awacke1 commited on
Commit
f8975bf
·
verified ·
1 Parent(s): 358aca7

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +89 -220
app.py CHANGED
@@ -1,4 +1,6 @@
1
  import streamlit as st
 
 
2
 
3
  st.set_page_config(page_title="Multi Agent Systems", page_icon=":robot_face:", layout="wide")
4
 
@@ -43,227 +45,94 @@ st.markdown("""
43
  1. **Communication** 🗣 \n
44
  - Agents exchange information \n
45
  2. **Cooperation** 🤝 \n
46
- -# 🩺🔍 Search Results
47
- ### 04 Dec 2023 | [AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents](https://arxiv.org/abs/2311.17465) | [⬇️](https://arxiv.org/pdf/2311.17465)
48
- *Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang*
49
-
50
- In this study, our goal is to create interactive avatar agents that can
51
- autonomously plan and animate nuanced facial movements realistically, from both
52
- visual and behavioral perspectives. Given high-level inputs about the
53
- environment and agent profile, our framework harnesses LLMs to produce a series
54
- of detailed text descriptions of the avatar agents' facial motions. These
55
- descriptions are then processed by our task-agnostic driving engine into motion
56
- token sequences, which are subsequently converted into continuous motion
57
- embeddings that are further consumed by our standalone neural-based renderer to
58
- generate the final photorealistic avatar animations. These streamlined
59
- processes allow our framework to adapt to a variety of non-verbal avatar
60
- interactions, both monadic and dyadic. Our extensive study, which includes
61
- experiments on both newly compiled and existing datasets featuring two types of
62
- agents -- one capable of monadic interaction with the environment, and the
63
- other designed for dyadic conversation -- validates the effectiveness and
64
- versatility of our approach. To our knowledge, we advanced a leap step by
65
- combining LLMs and neural rendering for generalized non-verbal prediction and
66
- photo-realistic rendering of avatar agents.
67
-
68
- ---------------
69
-
70
- ### 06 Jul 2023 | [Caption Anything: Interactive Image Description with Diverse Multimodal Controls](https://arxiv.org/abs/2305.02677) | [⬇️](https://arxiv.org/pdf/2305.02677)
71
- *Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao*
72
-
73
- Controllable image captioning is an emerging multimodal topic that aims to
74
- describe the image with natural language following human purpose,
75
- $\textit{e.g.}$, looking at the specified regions or telling in a particular
76
- text style. State-of-the-art methods are trained on annotated pairs of input
77
- controls and output captions. However, the scarcity of such well-annotated
78
- multimodal data largely limits their usability and scalability for interactive
79
- AI systems. Leveraging unimodal instruction-following foundation models is a
80
- promising alternative that benefits from broader sources of data. In this
81
- paper, we present Caption AnyThing (CAT), a foundation model augmented image
82
- captioning framework supporting a wide range of multimodel controls: 1) visual
83
- controls, including points, boxes, and trajectories; 2) language controls, such
84
- as sentiment, length, language, and factuality. Powered by Segment Anything
85
- Model (SAM) and ChatGPT, we unify the visual and language prompts into a
86
- modularized framework, enabling the flexible combination between different
87
- controls. Extensive case studies demonstrate the user intention alignment
88
- capabilities of our framework, shedding light on effective user interaction
89
- modeling in vision-language applications. Our code is publicly available at
90
- https://github.com/ttengwang/Caption-Anything.
91
-
92
- ---------------
93
-
94
- ### 13 Jul 2023 | [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) | [⬇️](https://arxiv.org/pdf/2306.14824)
95
- *Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei*
96
-
97
- We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
98
- capabilities of perceiving object descriptions (e.g., bounding boxes) and
99
- grounding text to the visual world. Specifically, we represent refer
100
- expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
101
- object descriptions are sequences of location tokens. Together with multimodal
102
- corpora, we construct large-scale data of grounded image-text pairs (called
103
- GrIT) to train the model. In addition to the existing capabilities of MLLMs
104
- (e.g., perceiving general modalities, following instructions, and performing
105
- in-context learning), Kosmos-2 integrates the grounding capability into
106
- downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
107
- including (i) multimodal grounding, such as referring expression comprehension,
108
- and phrase grounding, (ii) multimodal referring, such as referring expression
109
- generation, (iii) perception-language tasks, and (iv) language understanding
110
- and generation. This work lays out the foundation for the development of
111
- Embodiment AI and sheds light on the big convergence of language, multimodal
112
- perception, action, and world modeling, which is a key step toward artificial
113
- general intelligence. Code and pretrained models are available at
114
- https://aka.ms/kosmos-2.
115
-
116
- ---------------
117
-
118
- ### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [⬇️](https://arxiv.org/pdf/2402.04615)
119
- *Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Crbune, Jason Lin, Jindong Chen, Abhanshu Sharma*
120
-
121
- Screen user interfaces (UIs) and infographics, sharing similar visual
122
- language and design principles, play important roles in human communication and
123
- human-machine interaction. We introduce ScreenAI, a vision-language model that
124
- specializes in UI and infographics understanding. Our model improves upon the
125
- PaLI architecture with the flexible patching strategy of pix2struct and is
126
- trained on a unique mixture of datasets. At the heart of this mixture is a
127
- novel screen annotation task in which the model has to identify the type and
128
- location of UI elements. We use these text annotations to describe screens to
129
- Large Language Models and automatically generate question-answering (QA), UI
130
- navigation, and summarization training datasets at scale. We run ablation
131
- studies to demonstrate the impact of these design choices. At only 5B
132
- parameters, ScreenAI achieves new state-of-the-artresults on UI- and
133
- infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
134
- Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
135
- InfographicVQA) compared to models of similar size. Finally, we release three
136
- new datasets: one focused on the screen annotation task and two others focused
137
- on question answering.
138
-
139
- ---------------
140
-
141
- ### 23 Mar 2022 | [ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues](https://arxiv.org/abs/2203.12751) | [⬇️](https://arxiv.org/pdf/2203.12751)
142
- *Monica S. Lam, Giovanni Campagna, Mehrad Moradshahi, Sina J. Semnani, Silei Xu*
143
-
144
- Task-oriented conversational agents rely on semantic parsers to translate
145
- natural language to formal representations. In this paper, we propose the
146
- design and rationale of the ThingTalk formal representation, and how the design
147
- improves the development of transactional task-oriented agents.
148
- ThingTalk is built on four core principles: (1) representing user requests
149
- directly as executable statements, covering all the functionality of the agent,
150
- (2) representing dialogues formally and succinctly to support accurate
151
- contextual semantic parsing, (3) standardizing types and interfaces to maximize
152
- reuse between agents, and (4) allowing multiple, independently-developed agents
153
- to be composed in a single virtual assistant. ThingTalk is developed as part of
154
- the Genie Framework that allows developers to quickly build transactional
155
- agents given a database and APIs.
156
- We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
157
- Compared to the others, the ThingTalk design is both more general and more
158
- cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
159
- associated tools yields a new state of the art accuracy of 79% turn-by-turn.
160
-
161
- ---------------
162
-
163
- ### 19 Oct 2023 | [3D-GPT: Procedural 3D Modeling with Large Language Models](https://arxiv.org/abs/2310.12945) | [⬇️](https://arxiv.org/pdf/2310.12945)
164
- *Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould*
165
-
166
- In the pursuit of efficient automated content creation, procedural
167
- generation, leveraging modifiable parameters and rule-based systems, emerges as
168
- a promising approach. Nonetheless, it could be a demanding endeavor, given its
169
- intricate nature necessitating a deep understanding of rules, algorithms, and
170
- parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
171
- large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
172
- positions LLMs as proficient problem solvers, dissecting the procedural 3D
173
- modeling tasks into accessible segments and appointing the apt agent for each
174
- task. 3D-GPT integrates three core agents: the task dispatch agent, the
175
- conceptualization agent, and the modeling agent. They collaboratively achieve
176
- two objectives. First, it enhances concise initial scene descriptions, evolving
177
- them into detailed forms while dynamically adapting the text based on
178
- subsequent instructions. Second, it integrates procedural generation,
179
- extracting parameter values from enriched text to effortlessly interface with
180
- 3D software for asset creation. Our empirical investigations confirm that
181
- 3D-GPT not only interprets and executes instructions, delivering reliable
182
- results but also collaborates effectively with human designers. Furthermore, it
183
- seamlessly integrates with Blender, unlocking expanded manipulation
184
- possibilities. Our work highlights the potential of LLMs in 3D modeling,
185
- offering a basic framework for future advancements in scene generation and
186
- animation.
187
-
188
- ---------------
189
-
190
- ### 04 Jul 2023 | [Embodied Task Planning with Large Language Models](https://arxiv.org/abs/2307.01848) | [⬇️](https://arxiv.org/pdf/2307.01848)
191
- *Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan*
192
-
193
- Equipping embodied agents with commonsense is important for robots to
194
- successfully complete complex human instructions in general environments.
195
- Recent large language models (LLM) can embed rich semantic knowledge for agents
196
- in plan generation of complex tasks, while they lack the information about the
197
- realistic world and usually yield infeasible action sequences. In this paper,
198
- we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
199
- with physical scene constraint, where the agent generates executable plans
200
- according to the existed objects in the scene by aligning LLMs with the visual
201
- perception models. Specifically, we first construct a multimodal dataset
202
- containing triplets of indoor scenes, instructions and action plans, where we
203
- provide the designed prompts and the list of existing objects in the scene for
204
- GPT-3.5 to generate a large number of instructions and corresponding planned
205
- actions. The generated data is leveraged for grounded plan tuning of
206
- pre-trained LLMs. During inference, we discover the objects in the scene by
207
- extending open-vocabulary object detectors to multi-view RGB images collected
208
- in different achievable locations. Experimental results show that the generated
209
- plan from our TaPA framework can achieve higher success rate than LLaVA and
210
- GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
211
- planning in general and complex environments.
212
-
213
- ---------------
214
-
215
- ### 18 Jan 2023 | [Joint Representation Learning for Text and 3D Point Cloud](https://arxiv.org/abs/2301.07584) | [⬇️](https://arxiv.org/pdf/2301.07584)
216
- *Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang*
217
-
218
- Recent advancements in vision-language pre-training (e.g. CLIP) have shown
219
- that vision models can benefit from language supervision. While many models
220
- using language modality have achieved great success on 2D vision tasks, the
221
- joint representation learning of 3D point cloud with text remains
222
- under-explored due to the difficulty of 3D-Text data pair acquisition and the
223
- irregularity of 3D data structure. In this paper, we propose a novel Text4Point
224
- framework to construct language-guided 3D point cloud models. The key idea is
225
- utilizing 2D images as a bridge to connect the point cloud and the language
226
- modalities. The proposed Text4Point follows the pre-training and fine-tuning
227
- paradigm. During the pre-training stage, we establish the correspondence of
228
- images and point clouds based on the readily available RGB-D data and use
229
- contrastive learning to align the image and point cloud representations.
230
- Together with the well-aligned image and text features achieved by CLIP, the
231
- point cloud features are implicitly aligned with the text embeddings. Further,
232
- we propose a Text Querying Module to integrate language information into 3D
233
- representation learning by querying text embeddings with point cloud features.
234
- For fine-tuning, the model learns task-specific 3D representations under
235
- informative language guidance from the label set without 2D images. Extensive
236
- experiments demonstrate that our model shows consistent improvement on various
237
- downstream tasks, such as point cloud semantic segmentation, instance
238
- segmentation, and object detection. The code will be available here:
239
- https://github.com/LeapLabTHU/Text4Point
240
 
241
- ---------------
242
 
243
- ### 01 Feb 2024 | [Executable Code Actions Elicit Better LLM Agents](https://arxiv.org/abs/2402.01030) | [⬇️](https://arxiv.org/pdf/2402.01030)
244
- *Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
245
 
246
- Large Language Model (LLM) agents, capable of performing a broad range of
247
- actions, such as invoking tools and controlling robots, show great potential in
248
- tackling real-world challenges. LLM agents are typically prompted to produce
249
- actions by generating JSON or text in a pre-defined format, which is usually
250
- limited by constrained action space (e.g., the scope of pre-defined tools) and
251
- restricted flexibility (e.g., inability to compose multiple tools). This work
252
- proposes to use executable Python code to consolidate LLM agents' actions into
253
- a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
254
- can execute code actions and dynamically revise prior actions or emit new
255
- actions upon new observations through multi-turn interactions. Our extensive
256
- analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
257
- CodeAct outperforms widely used alternatives (up to 20% higher success rate).
258
- The encouraging performance of CodeAct motivates us to build an open-source LLM
259
- agent that interacts with environments by executing interpretable code and
260
- collaborates with users using natural language. To this end, we collect an
261
- instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
262
- interactions using CodeAct. We show that it can be used with existing data to
263
- improve models in agent-oriented tasks without compromising their general
264
- capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
265
- Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
266
- model training) using existing libraries and autonomously self-debug.
267
 
268
- ---------------
269
- """)
 
1
  import streamlit as st
2
+ import pandas as pd
3
+ import random
4
 
5
  st.set_page_config(page_title="Multi Agent Systems", page_icon=":robot_face:", layout="wide")
6
 
 
45
  1. **Communication** 🗣 \n
46
  - Agents exchange information \n
47
  2. **Cooperation** 🤝 \n
48
+ - Agents work together to achieve common goals \n
49
+ """)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
+ st.markdown("---")
52
 
53
+ papers = [
54
+ {
55
+ "date": "04 Dec 2023",
56
+ "title": "AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents",
57
+ "link": "https://arxiv.org/pdf/2311.17465",
58
+ "summary": "In this study, our goal is to create interactive avatar agents that can autonomously plan and animate nuanced facial movements realistically, from both visual and behavioral perspectives. Given high-level inputs about the environment and agent profile, our framework harnesses LLMs to produce a series of detailed text descriptions of the avatar agents' facial motions. These descriptions are then processed by our task-agnostic driving engine into motion token sequences, which are subsequently converted into continuous motion embeddings that are further consumed by our standalone neural-based renderer to generate the final photorealistic avatar animations. These streamlined processes allow our framework to adapt to a variety of non-verbal avatar interactions, both monadic and dyadic. Our extensive study, which includes experiments on both newly compiled and existing datasets featuring two types of agents -- one capable of monadic interaction with the environment, and the other designed for dyadic conversation -- validates the effectiveness and versatility of our approach. To our knowledge, we advanced a leap step by combining LLMs and neural rendering for generalized non-verbal prediction and photo-realistic rendering of avatar agents."
59
+ },
60
+ {
61
+ "date": "06 Jul 2023",
62
+ "title": "Caption Anything: Interactive Image Description with Diverse Multimodal Controls",
63
+ "link": "https://arxiv.org/pdf/2305.02677",
64
+ "summary": "Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, $\textit{e.g.}$, looking at the specified regions or telling in a particular text style. State-of-the-art methods are trained on annotated pairs of input controls and output captions. However, the scarcity of such well-annotated multimodal data largely limits their usability and scalability for interactive AI systems. Leveraging unimodal instruction-following foundation models is a promising alternative that benefits from broader sources of data. In this paper, we present Caption AnyThing (CAT), a foundation model augmented image captioning framework supporting a wide range of multimodel controls: 1) visual controls, including points, boxes, and trajectories; 2) language controls, such as sentiment, length, language, and factuality. Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework, enabling the flexible combination between different controls. Extensive case studies demonstrate the user intention alignment capabilities of our framework, shedding light on effective user interaction modeling in vision-language applications. Our code is publicly available at https://github.com/ttengwang/Caption-Anything."
65
+ },
66
+ {
67
+ "date": "13 Jul 2023",
68
+ "title": "Kosmos-2: Grounding Multimodal Large Language Models to the World",
69
+ "link": "https://arxiv.org/pdf/2306.14824",
70
+ "summary": "We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Code and pretrained models are available at https://aka.ms/kosmos-2."
71
+ },
72
+ {
73
+ "date": "19 Feb 2024",
74
+ "title": "ScreenAI: A Vision-Language Model for UI and Infographics Understanding",
75
+ "link": "https://arxiv.org/pdf/2402.04615",
76
+ "summary": "Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering."
77
+ },
78
+ {
79
+ "date": "23 Mar 2022",
80
+ "title": "ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues",
81
+ "link": "https://arxiv.org/pdf/2203.12751",
82
+ "summary": "Task-oriented conversational agents rely on semantic parsers to translate natural language to formal representations. In this paper, we propose the design and rationale of the ThingTalk formal representation, and how the design improves the development of transactional task-oriented agents. ThingTalk is built on four core principles: (1) representing user requests directly as executable statements, covering all the functionality of the agent, (2) representing dialogues formally and succinctly to support accurate contextual semantic parsing, (3) standardizing types and interfaces to maximize reuse between agents, and (4) allowing multiple, independently-developed agents to be composed in a single virtual assistant. ThingTalk is developed as part of the Genie Framework that allows developers to quickly build transactional agents given a database and APIs. We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST. Compared to the others, the ThingTalk design is both more general and more cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and associated tools yields a new state of the art accuracy of 79% turn-by-turn."
83
+ },
84
+ {
85
+ "date": "19 Oct 2023",
86
+ "title": "3D-GPT: Procedural 3D Modeling with Large Language Models",
87
+ "link": "https://arxiv.org/pdf/2310.12945",
88
+ "summary": "In the pursuit of efficient automated content creation, procedural generation, leveraging modifiable parameters and rule-based systems, emerges as a promising approach. Nonetheless, it could be a demanding endeavor, given its intricate nature necessitating a deep understanding of rules, algorithms, and parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT positions LLMs as proficient problem solvers, dissecting the procedural 3D modeling tasks into accessible segments and appointing the apt agent for each task. 3D-GPT integrates three core agents: the task dispatch agent, the conceptualization agent, and the modeling agent. They collaboratively achieve two objectives. First, it enhances concise initial scene descriptions, evolving them into detailed forms while dynamically adapting the text based on subsequent instructions. Second, it integrates procedural generation, extracting parameter values from enriched text to effortlessly interface with 3D software for asset creation. Our empirical investigations confirm that 3D-GPT not only interprets and executes instructions, delivering reliable results but also collaborates effectively with human designers. Furthermore, it seamlessly integrates with Blender, unlocking expanded manipulation possibilities. Our work highlights the potential of LLMs in 3D modeling, offering a basic framework for future advancements in scene generation and animation."
89
+ },
90
+ {
91
+ "date": "04 Jul 2023",
92
+ "title": "Embodied Task Planning with Large Language Models",
93
+ "link": "https://arxiv.org/pdf/2307.01848",
94
+ "summary": "Equipping embodied agents with commonsense is important for robots to successfully complete complex human instructions in general environments. Recent large language models (LLM) can embed rich semantic knowledge for agents in plan generation of complex tasks, while they lack the information about the realistic world and usually yield infeasible action sequences. In this paper, we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint, where the agent generates executable plans according to the existed objects in the scene by aligning LLMs with the visual perception models. Specifically, we first construct a multimodal dataset containing triplets of indoor scenes, instructions and action plans, where we provide the designed prompts and the list of existing objects in the scene for GPT-3.5 to generate a large number of instructions and corresponding planned actions. The generated data is leveraged for grounded plan tuning of pre-trained LLMs. During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations. Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin, which indicates the practicality of embodied task planning in general and complex environments."
95
+ },
96
+ {
97
+ "date": "18 Jan 2023",
98
+ "title": "Joint Representation Learning for Text and 3D Point Cloud",
99
+ "link": "https://arxiv.org/pdf/2301.07584",
100
+ "summary": "Recent advancements in vision-language pre-training (e.g. CLIP) have shown that vision models can benefit from language supervision. While many models using language modality have achieved great success on 2D vision tasks, the joint representation learning of 3D point cloud with text remains under-explored due to the difficulty of 3D-Text data pair acquisition and the irregularity of 3D data structure. In this paper, we propose a novel Text4Point framework to construct language-guided 3D point cloud models. The key idea is utilizing 2D images as a bridge to connect the point cloud and the language modalities. The proposed Text4Point follows the pre-training and fine-tuning paradigm. During the pre-training stage, we establish the correspondence of images and point clouds based on the readily available RGB-D data and use contrastive learning to align the image and point cloud representations. Together with the well-aligned image and text features achieved by CLIP, the point cloud features are implicitly aligned with the text embeddings. Further, we propose a Text Querying Module to integrate language information into 3D representation learning by querying text embeddings with point cloud features. For fine-tuning, the model learns task-specific 3D representations under informative language guidance from the label set without 2D images. Extensive experiments demonstrate that our model shows consistent improvement on various downstream tasks, such as point cloud semantic segmentation, instance segmentation, and object detection. The code will be available here: https://github.com/LeapLabTHU/Text4Point"
101
+ },
102
+ {
103
+ "date": "01 Feb 2024",
104
+ "title": "Executable Code Actions Elicit Better LLM Agents",
105
+ "link": "https://arxiv.org/pdf/2402.01030",
106
+ "summary": "Large Language Model (LLM) agents, capable of performing a broad range of actions, such as invoking tools and controlling robots, show great potential in tackling real-world challenges. LLM agents are typically prompted to produce actions by generating JSON or text in a pre-defined format, which is usually limited by constrained action space (e.g., the scope of pre-defined tools) and restricted flexibility (e.g., inability to compose multiple tools). This work proposes to use executable Python code to consolidate LLM agents' actions into a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions. Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives (up to 20% higher success rate). The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language. To this end, we collect an instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn interactions using CodeAct. We show that it can be used with existing data to improve models in agent-oriented tasks without compromising their general capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., model training) using existing libraries and autonomously self-debug."
107
+ }
108
+ ]
109
+
110
+ df = pd.DataFrame(papers)
111
+
112
+ num_columns = st.slider("Number of Columns", min_value=1, max_value=4, value=2)
113
+
114
+ for i in range(0, len(df), num_columns):
115
+ cols = st.columns(num_columns)
116
+ for j in range(num_columns):
117
+ if i + j < len(df):
118
+ paper = df.iloc[i + j]
119
+ with cols[j]:
120
+ st.markdown(f"<div style='border: 1px solid #ccc; padding: 10px; border-radius: 5px;'><h4>{paper['title']}</h4><p>{paper['date']}</p><p>{paper['summary']}</p><a href='{paper['link']}'>PDF</a></div>", unsafe_allow_html=True)
121
+
122
+ svg_code = f"""
123
+ <svg width="100%" height="200">
124
+ <defs>
125
+ <marker id="arrowhead" markerWidth="10" markerHeight="7" refX="0" refY="3.5" orient="auto">
126
+ <polygon points="0 0, 10 3.5, 0 7" fill="#000"/>
127
+ </marker>
128
+ </defs>
129
+ <line x1="0" y1="{random.randint(50, 150)}" x2="100%" y2="{random.randint(50, 150)}" stroke="#000" stroke-width="2" marker-end="url(#arrowhead)">
130
+ <animate attributeName="x1" from="0" to="100%" dur="{random.randint(3, 6)}s" repeatCount="indefinite"/>
131
+ <animate attributeName="x2" from="100%" to="0" dur="{random.randint(3, 6)}s" repeatCount="indefinite"/>
132
+ </line>
133
+ </svg>
134
+ """
135
+
136
+ st.markdown(svg_code, unsafe_allow_html=True)
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138