awacke1 commited on
Commit
8d8a8b9
·
verified ·
1 Parent(s): 9e832a1

Create app.py

Browse files
Files changed (1) hide show
  1. app.py +744 -0
app.py ADDED
@@ -0,0 +1,744 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+
3
+ st.set_page_config(page_title="Multi Agent Systems", page_icon=":robot_face:", layout="wide")
4
+
5
+ hide_streamlit_style = """
6
+ <style>
7
+ #MainMenu {visibility: hidden;}
8
+ footer {visibility: hidden;}
9
+ </style>
10
+ """
11
+ st.markdown(hide_streamlit_style, unsafe_allow_html=True)
12
+
13
+ col1, col2 = st.beta_columns(2)
14
+
15
+ with col1:
16
+ st.markdown("## **Autonomous agents interacting** :robot_face: :robot_face:**")
17
+ st.markdown("### **Key Aspects** :bulb:")
18
+ st.markdown("""
19
+ 1. **Interaction Protocol** 🤝 \n
20
+ - Define rules for communication and cooperation \n
21
+ 2. **Decentralized Decision Making** 🎯 \n
22
+ - Autonomous agents make independent decisions \n
23
+ 3. **Collaboration and Competition** 🤼 \n
24
+ - Agents work together or against each other \n
25
+ """)
26
+
27
+ with col2:
28
+ st.markdown("### **Entities** :guards:")
29
+ st.markdown("""
30
+ 1. **Autonomous Agents** 🤖 \n
31
+ - Independent entities with decision-making capabilities \n
32
+ 2. **Environment** 🌐 \n
33
+ - Shared space where agents interact \n
34
+ 3. **Ruleset** 📜 \n
35
+ - Defines interaction protocol and decision-making processes \n
36
+ """)
37
+
38
+ st.markdown("---")
39
+
40
+ st.markdown("## **Interaction Protocol** 🤝 :bulb:**")
41
+ st.markdown("### **Key Elements** :guards:")
42
+ st.markdown("""
43
+ 1. **Communication** 🗣 \n
44
+ - Agents exchange information \n
45
+ 2. **Cooperation** 🤝 \n
46
+ -# 🩺🔍 Search Results
47
+ ### 04 Dec 2023 | [AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents](https://arxiv.org/abs/2311.17465) | [⬇️](https://arxiv.org/pdf/2311.17465)
48
+ *Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang*
49
+
50
+ In this study, our goal is to create interactive avatar agents that can
51
+ autonomously plan and animate nuanced facial movements realistically, from both
52
+ visual and behavioral perspectives. Given high-level inputs about the
53
+ environment and agent profile, our framework harnesses LLMs to produce a series
54
+ of detailed text descriptions of the avatar agents' facial motions. These
55
+ descriptions are then processed by our task-agnostic driving engine into motion
56
+ token sequences, which are subsequently converted into continuous motion
57
+ embeddings that are further consumed by our standalone neural-based renderer to
58
+ generate the final photorealistic avatar animations. These streamlined
59
+ processes allow our framework to adapt to a variety of non-verbal avatar
60
+ interactions, both monadic and dyadic. Our extensive study, which includes
61
+ experiments on both newly compiled and existing datasets featuring two types of
62
+ agents -- one capable of monadic interaction with the environment, and the
63
+ other designed for dyadic conversation -- validates the effectiveness and
64
+ versatility of our approach. To our knowledge, we advanced a leap step by
65
+ combining LLMs and neural rendering for generalized non-verbal prediction and
66
+ photo-realistic rendering of avatar agents.
67
+
68
+ ---------------
69
+
70
+ ### 06 Jul 2023 | [Caption Anything: Interactive Image Description with Diverse Multimodal Controls](https://arxiv.org/abs/2305.02677) | [⬇️](https://arxiv.org/pdf/2305.02677)
71
+ *Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao*
72
+
73
+ Controllable image captioning is an emerging multimodal topic that aims to
74
+ describe the image with natural language following human purpose,
75
+ $\textit{e.g.}$, looking at the specified regions or telling in a particular
76
+ text style. State-of-the-art methods are trained on annotated pairs of input
77
+ controls and output captions. However, the scarcity of such well-annotated
78
+ multimodal data largely limits their usability and scalability for interactive
79
+ AI systems. Leveraging unimodal instruction-following foundation models is a
80
+ promising alternative that benefits from broader sources of data. In this
81
+ paper, we present Caption AnyThing (CAT), a foundation model augmented image
82
+ captioning framework supporting a wide range of multimodel controls: 1) visual
83
+ controls, including points, boxes, and trajectories; 2) language controls, such
84
+ as sentiment, length, language, and factuality. Powered by Segment Anything
85
+ Model (SAM) and ChatGPT, we unify the visual and language prompts into a
86
+ modularized framework, enabling the flexible combination between different
87
+ controls. Extensive case studies demonstrate the user intention alignment
88
+ capabilities of our framework, shedding light on effective user interaction
89
+ modeling in vision-language applications. Our code is publicly available at
90
+ https://github.com/ttengwang/Caption-Anything.
91
+
92
+ ---------------
93
+
94
+ ### 13 Jul 2023 | [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) | [⬇️](https://arxiv.org/pdf/2306.14824)
95
+ *Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei*
96
+
97
+ We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
98
+ capabilities of perceiving object descriptions (e.g., bounding boxes) and
99
+ grounding text to the visual world. Specifically, we represent refer
100
+ expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
101
+ object descriptions are sequences of location tokens. Together with multimodal
102
+ corpora, we construct large-scale data of grounded image-text pairs (called
103
+ GrIT) to train the model. In addition to the existing capabilities of MLLMs
104
+ (e.g., perceiving general modalities, following instructions, and performing
105
+ in-context learning), Kosmos-2 integrates the grounding capability into
106
+ downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
107
+ including (i) multimodal grounding, such as referring expression comprehension,
108
+ and phrase grounding, (ii) multimodal referring, such as referring expression
109
+ generation, (iii) perception-language tasks, and (iv) language understanding
110
+ and generation. This work lays out the foundation for the development of
111
+ Embodiment AI and sheds light on the big convergence of language, multimodal
112
+ perception, action, and world modeling, which is a key step toward artificial
113
+ general intelligence. Code and pretrained models are available at
114
+ https://aka.ms/kosmos-2.
115
+
116
+ ---------------
117
+
118
+ ### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [⬇️](https://arxiv.org/pdf/2402.04615)
119
+ *Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor C\u{a}rbune, Jason Lin, Jindong Chen, Abhanshu Sharma*
120
+
121
+ Screen user interfaces (UIs) and infographics, sharing similar visual
122
+ language and design principles, play important roles in human communication and
123
+ human-machine interaction. We introduce ScreenAI, a vision-language model that
124
+ specializes in UI and infographics understanding. Our model improves upon the
125
+ PaLI architecture with the flexible patching strategy of pix2struct and is
126
+ trained on a unique mixture of datasets. At the heart of this mixture is a
127
+ novel screen annotation task in which the model has to identify the type and
128
+ location of UI elements. We use these text annotations to describe screens to
129
+ Large Language Models and automatically generate question-answering (QA), UI
130
+ navigation, and summarization training datasets at scale. We run ablation
131
+ studies to demonstrate the impact of these design choices. At only 5B
132
+ parameters, ScreenAI achieves new state-of-the-artresults on UI- and
133
+ infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
134
+ Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
135
+ InfographicVQA) compared to models of similar size. Finally, we release three
136
+ new datasets: one focused on the screen annotation task and two others focused
137
+ on question answering.
138
+
139
+ ---------------
140
+
141
+ ### 23 Mar 2022 | [ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues](https://arxiv.org/abs/2203.12751) | [⬇️](https://arxiv.org/pdf/2203.12751)
142
+ *Monica S. Lam, Giovanni Campagna, Mehrad Moradshahi, Sina J. Semnani, Silei Xu*
143
+
144
+ Task-oriented conversational agents rely on semantic parsers to translate
145
+ natural language to formal representations. In this paper, we propose the
146
+ design and rationale of the ThingTalk formal representation, and how the design
147
+ improves the development of transactional task-oriented agents.
148
+ ThingTalk is built on four core principles: (1) representing user requests
149
+ directly as executable statements, covering all the functionality of the agent,
150
+ (2) representing dialogues formally and succinctly to support accurate
151
+ contextual semantic parsing, (3) standardizing types and interfaces to maximize
152
+ reuse between agents, and (4) allowing multiple, independently-developed agents
153
+ to be composed in a single virtual assistant. ThingTalk is developed as part of
154
+ the Genie Framework that allows developers to quickly build transactional
155
+ agents given a database and APIs.
156
+ We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
157
+ Compared to the others, the ThingTalk design is both more general and more
158
+ cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
159
+ associated tools yields a new state of the art accuracy of 79% turn-by-turn.
160
+
161
+ ---------------
162
+
163
+ ### 19 Oct 2023 | [3D-GPT: Procedural 3D Modeling with Large Language Models](https://arxiv.org/abs/2310.12945) | [⬇️](https://arxiv.org/pdf/2310.12945)
164
+ *Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould*
165
+
166
+ In the pursuit of efficient automated content creation, procedural
167
+ generation, leveraging modifiable parameters and rule-based systems, emerges as
168
+ a promising approach. Nonetheless, it could be a demanding endeavor, given its
169
+ intricate nature necessitating a deep understanding of rules, algorithms, and
170
+ parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
171
+ large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
172
+ positions LLMs as proficient problem solvers, dissecting the procedural 3D
173
+ modeling tasks into accessible segments and appointing the apt agent for each
174
+ task. 3D-GPT integrates three core agents: the task dispatch agent, the
175
+ conceptualization agent, and the modeling agent. They collaboratively achieve
176
+ two objectives. First, it enhances concise initial scene descriptions, evolving
177
+ them into detailed forms while dynamically adapting the text based on
178
+ subsequent instructions. Second, it integrates procedural generation,
179
+ extracting parameter values from enriched text to effortlessly interface with
180
+ 3D software for asset creation. Our empirical investigations confirm that
181
+ 3D-GPT not only interprets and executes instructions, delivering reliable
182
+ results but also collaborates effectively with human designers. Furthermore, it
183
+ seamlessly integrates with Blender, unlocking expanded manipulation
184
+ possibilities. Our work highlights the potential of LLMs in 3D modeling,
185
+ offering a basic framework for future advancements in scene generation and
186
+ animation.
187
+
188
+ ---------------
189
+
190
+ ### 04 Jul 2023 | [Embodied Task Planning with Large Language Models](https://arxiv.org/abs/2307.01848) | [⬇️](https://arxiv.org/pdf/2307.01848)
191
+ *Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan*
192
+
193
+ Equipping embodied agents with commonsense is important for robots to
194
+ successfully complete complex human instructions in general environments.
195
+ Recent large language models (LLM) can embed rich semantic knowledge for agents
196
+ in plan generation of complex tasks, while they lack the information about the
197
+ realistic world and usually yield infeasible action sequences. In this paper,
198
+ we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
199
+ with physical scene constraint, where the agent generates executable plans
200
+ according to the existed objects in the scene by aligning LLMs with the visual
201
+ perception models. Specifically, we first construct a multimodal dataset
202
+ containing triplets of indoor scenes, instructions and action plans, where we
203
+ provide the designed prompts and the list of existing objects in the scene for
204
+ GPT-3.5 to generate a large number of instructions and corresponding planned
205
+ actions. The generated data is leveraged for grounded plan tuning of
206
+ pre-trained LLMs. During inference, we discover the objects in the scene by
207
+ extending open-vocabulary object detectors to multi-view RGB images collected
208
+ in different achievable locations. Experimental results show that the generated
209
+ plan from our TaPA framework can achieve higher success rate than LLaVA and
210
+ GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
211
+ planning in general and complex environments.
212
+
213
+ ---------------
214
+
215
+ ### 18 Jan 2023 | [Joint Representation Learning for Text and 3D Point Cloud](https://arxiv.org/abs/2301.07584) | [⬇️](https://arxiv.org/pdf/2301.07584)
216
+ *Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang*
217
+
218
+ Recent advancements in vision-language pre-training (e.g. CLIP) have shown
219
+ that vision models can benefit from language supervision. While many models
220
+ using language modality have achieved great success on 2D vision tasks, the
221
+ joint representation learning of 3D point cloud with text remains
222
+ under-explored due to the difficulty of 3D-Text data pair acquisition and the
223
+ irregularity of 3D data structure. In this paper, we propose a novel Text4Point
224
+ framework to construct language-guided 3D point cloud models. The key idea is
225
+ utilizing 2D images as a bridge to connect the point cloud and the language
226
+ modalities. The proposed Text4Point follows the pre-training and fine-tuning
227
+ paradigm. During the pre-training stage, we establish the correspondence of
228
+ images and point clouds based on the readily available RGB-D data and use
229
+ contrastive learning to align the image and point cloud representations.
230
+ Together with the well-aligned image and text features achieved by CLIP, the
231
+ point cloud features are implicitly aligned with the text embeddings. Further,
232
+ we propose a Text Querying Module to integrate language information into 3D
233
+ representation learning by querying text embeddings with point cloud features.
234
+ For fine-tuning, the model learns task-specific 3D representations under
235
+ informative language guidance from the label set without 2D images. Extensive
236
+ experiments demonstrate that our model shows consistent improvement on various
237
+ downstream tasks, such as point cloud semantic segmentation, instance
238
+ segmentation, and object detection. The code will be available here:
239
+ https://github.com/LeapLabTHU/Text4Point
240
+
241
+ ---------------
242
+
243
+ ### 01 Feb 2024 | [Executable Code Actions Elicit Better LLM Agents](https://arxiv.org/abs/2402.01030) | [⬇️](https://arxiv.org/pdf/2402.01030)
244
+ *Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji*
245
+
246
+ Large Language Model (LLM) agents, capable of performing a broad range of
247
+ actions, such as invoking tools and controlling robots, show great potential in
248
+ tackling real-world challenges. LLM agents are typically prompted to produce
249
+ actions by generating JSON or text in a pre-defined format, which is usually
250
+ limited by constrained action space (e.g., the scope of pre-defined tools) and
251
+ restricted flexibility (e.g., inability to compose multiple tools). This work
252
+ proposes to use executable Python code to consolidate LLM agents' actions into
253
+ a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
254
+ can execute code actions and dynamically revise prior actions or emit new
255
+ actions upon new observations through multi-turn interactions. Our extensive
256
+ analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
257
+ CodeAct outperforms widely used alternatives (up to 20% higher success rate).
258
+ The encouraging performance of CodeAct motivates us to build an open-source LLM
259
+ agent that interacts with environments by executing interpretable code and
260
+ collaborates with users using natural language. To this end, we collect an
261
+ instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
262
+ interactions using CodeAct. We show that it can be used with existing data to
263
+ improve models in agent-oriented tasks without compromising their general
264
+ capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
265
+ Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
266
+ model training) using existing libraries and autonomously self-debug.
267
+
268
+ ---------------
269
+
270
+ ### 24 Jan 2024 | [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) | [⬇️](https://arxiv.org/pdf/2401.13649)
271
+ *Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried*
272
+
273
+ Autonomous agents capable of planning, reasoning, and executing actions on
274
+ the web offer a promising avenue for automating computer tasks. However, the
275
+ majority of existing benchmarks primarily focus on text-based agents,
276
+ neglecting many natural tasks that require visual information to effectively
277
+ solve. Given that most computer interfaces cater to human perception, visual
278
+ information often augments textual data in ways that text-only models struggle
279
+ to harness effectively. To bridge this gap, we introduce VisualWebArena, a
280
+ benchmark designed to assess the performance of multimodal web agents on
281
+ realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set
282
+ of diverse and complex web-based tasks that evaluate various capabilities of
283
+ autonomous multimodal agents. To perform on this benchmark, agents need to
284
+ accurately process image-text inputs, interpret natural language instructions,
285
+ and execute actions on websites to accomplish user-defined objectives. We
286
+ conduct an extensive evaluation of state-of-the-art LLM-based autonomous
287
+ agents, including several multimodal models. Through extensive quantitative and
288
+ qualitative analysis, we identify several limitations of text-only LLM agents,
289
+ and reveal gaps in the capabilities of state-of-the-art multimodal language
290
+ agents. VisualWebArena provides a framework for evaluating multimodal
291
+ autonomous language agents, and offers insights towards building stronger
292
+ autonomous agents for the web. Our code, baseline models, and data is publicly
293
+ available at https://jykoh.com/vwa.
294
+
295
+ ---------------
296
+
297
+ ### 22 Feb 2018 | [Multimodal Named Entity Recognition for Short Social Media Posts](https://arxiv.org/abs/1802.07862) | [⬇️](https://arxiv.org/pdf/1802.07862)
298
+ *Seungwhan Moon, Leonardo Neves, Vitor Carvalho*
299
+
300
+ We introduce a new task called Multimodal Named Entity Recognition (MNER) for
301
+ noisy user-generated data such as tweets or Snapchat captions, which comprise
302
+ short text with accompanying images. These social media posts often come in
303
+ inconsistent or incomplete syntax and lexical notations with very limited
304
+ surrounding textual contexts, bringing significant challenges for NER. To this
305
+ end, we create a new dataset for MNER called SnapCaptions (Snapchat
306
+ image-caption pairs submitted to public and crowd-sourced stories with fully
307
+ annotated named entities). We then build upon the state-of-the-art Bi-LSTM
308
+ word/character based NER models with 1) a deep image network which incorporates
309
+ relevant visual context to augment textual information, and 2) a generic
310
+ modality-attention module which learns to attenuate irrelevant modalities while
311
+ amplifying the most informative ones to extract contexts from, adaptive to each
312
+ sample and token. The proposed MNER model with modality attention significantly
313
+ outperforms the state-of-the-art text-only NER models by successfully
314
+ leveraging provided visual contexts, opening up potential applications of MNER
315
+ on myriads of social media platforms.
316
+
317
+ ---------------
318
+
319
+ ### 21 Sep 2023 | [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) | [⬇️](https://arxiv.org/pdf/2309.11436)
320
+ *Zhuosheng Zhang, Aston Zhang*
321
+
322
+ Autonomous user interface (UI) agents aim to facilitate task automation by
323
+ interacting with the user interface without manual intervention. Recent studies
324
+ have investigated eliciting the capabilities of large language models (LLMs)
325
+ for effective engagement in diverse environments. To align with the
326
+ input-output requirement of LLMs, existing approaches are developed under a
327
+ sandbox setting where they rely on external tools and application-specific APIs
328
+ to parse the environment into textual elements and interpret the predicted
329
+ actions. Consequently, those approaches often grapple with inference
330
+ inefficiency and error propagation risks. To mitigate the challenges, we
331
+ introduce Auto-UI, a multimodal solution that directly interacts with the
332
+ interface, bypassing the need for environment parsing or reliance on
333
+ application-dependent APIs. Moreover, we propose a chain-of-action technique --
334
+ leveraging a series of intermediate previous action histories and future action
335
+ plans -- to help the agent decide what action to execute. We evaluate our
336
+ approach on a new device-control benchmark AITW with 30K unique instructions,
337
+ spanning multi-step tasks such as application operation, web searching, and web
338
+ shopping. Experimental results show that Auto-UI achieves state-of-the-art
339
+ performance with an action type prediction accuracy of 90% and an overall
340
+ action success rate of 74%. Code is publicly available at
341
+ https://github.com/cooelf/Auto-UI.
342
+
343
+ ---------------
344
+
345
+ ### 06 Jun 2023 | [LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models](https://arxiv.org/abs/2303.02927) | [⬇️](https://arxiv.org/pdf/2303.02927)
346
+ *Victor Dibia*
347
+
348
+ Systems that support users in the automatic creation of visualizations must
349
+ address several subtasks - understand the semantics of data, enumerate relevant
350
+ visualization goals and generate visualization specifications. In this work, we
351
+ pose visualization generation as a multi-stage generation problem and argue
352
+ that well-orchestrated pipelines based on large language models (LLMs) such as
353
+ ChatGPT/GPT-4 and image generation models (IGMs) are suitable to addressing
354
+ these tasks. We present LIDA, a novel tool for generating grammar-agnostic
355
+ visualizations and infographics. LIDA comprises of 4 modules - A SUMMARIZER
356
+ that converts data into a rich but compact natural language summary, a GOAL
357
+ EXPLORER that enumerates visualization goals given the data, a VISGENERATOR
358
+ that generates, refines, executes and filters visualization code and an
359
+ INFOGRAPHER module that yields data-faithful stylized graphics using IGMs. LIDA
360
+ provides a python api, and a hybrid user interface (direct manipulation and
361
+ multilingual natural language) for interactive chart, infographics and data
362
+ story generation. Learn more about the project here -
363
+ https://microsoft.github.io/lida/
364
+
365
+ ---------------
366
+
367
+ ### 16 Feb 2023 | [VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning](https://arxiv.org/abs/2211.15103) | [⬇️](https://arxiv.org/pdf/2211.15103)
368
+ *Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le*
369
+
370
+ Video paragraph captioning aims to generate a multi-sentence description of
371
+ an untrimmed video with several temporal event locations in coherent
372
+ storytelling. Following the human perception process, where the scene is
373
+ effectively understood by decomposing it into visual (e.g. human, animal) and
374
+ non-visual components (e.g. action, relations) under the mutual influence of
375
+ vision and language, we first propose a visual-linguistic (VL) feature. In the
376
+ proposed VL feature, the scene is modeled by three modalities including (i) a
377
+ global visual environment; (ii) local visual main agents; (iii) linguistic
378
+ scene elements. We then introduce an autoregressive Transformer-in-Transformer
379
+ (TinT) to simultaneously capture the semantic coherence of intra- and
380
+ inter-event contents within a video. Finally, we present a new VL contrastive
381
+ loss function to guarantee learnt embedding features are matched with the
382
+ captions semantics. Comprehensive experiments and extensive ablation studies on
383
+ ActivityNet Captions and YouCookII datasets show that the proposed
384
+ Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior
385
+ state-of-the-art methods on accuracy and diversity. Source code is made
386
+ publicly available at: https://github.com/UARK-AICV/VLTinT.
387
+
388
+ ---------------
389
+
390
+ ### 04 Mar 2021 | [FAtiMA Toolkit -- Toward an effective and accessible tool for the development of intelligent virtual agents and social robots](https://arxiv.org/abs/2103.03020) | [⬇️](https://arxiv.org/pdf/2103.03020)
391
+ *Samuel Mascarenhas, Manuel Guimar\~aes, Pedro A. Santos, Jo\~ao Dias, Rui Prada, Ana Paiva*
392
+
393
+ More than a decade has passed since the development of FearNot!, an
394
+ application designed to help children deal with bullying through role-playing
395
+ with virtual characters. It was also the application that led to the creation
396
+ of FAtiMA, an affective agent architecture for creating autonomous characters
397
+ that can evoke empathic responses. In this paper, we describe FAtiMA Toolkit, a
398
+ collection of open-source tools that is designed to help researchers, game
399
+ developers and roboticists incorporate a computational model of emotion and
400
+ decision-making in their work. The toolkit was developed with the goal of
401
+ making FAtiMA more accessible, easier to incorporate into different projects
402
+ and more flexible in its capabilities for human-agent interaction, based upon
403
+ the experience gathered over the years across different virtual environments
404
+ and human-robot interaction scenarios. As a result, this work makes several
405
+ different contributions to the field of Agent-Based Architectures. More
406
+ precisely, FAtiMA Toolkit's library based design allows developers to easily
407
+ integrate it with other frameworks, its meta-cognitive model affords different
408
+ internal reasoners and affective components and its explicit dialogue structure
409
+ gives control to the author even within highly complex scenarios. To
410
+ demonstrate the use of FAtiMA Toolkit, several different use cases where the
411
+ toolkit was successfully applied are described and discussed.
412
+
413
+ ---------------
414
+
415
+ ### 12 Sep 2022 | [emojiSpace: Spatial Representation of Emojis](https://arxiv.org/abs/2209.09871) | [⬇️](https://arxiv.org/pdf/2209.09871)
416
+ *Moeen Mostafavi, Mahsa Pahlavikhah Varnosfaderani, Fateme Nikseresht, Seyed Ahmad Mansouri*
417
+
418
+ In the absence of nonverbal cues during messaging communication, users
419
+ express part of their emotions using emojis. Thus, having emojis in the
420
+ vocabulary of text messaging language models can significantly improve many
421
+ natural language processing (NLP) applications such as online communication
422
+ analysis. On the other hand, word embedding models are usually trained on a
423
+ very large corpus of text such as Wikipedia or Google News datasets that
424
+ include very few samples with emojis. In this study, we create emojiSpace,
425
+ which is a combined word-emoji embedding using the word2vec model from the
426
+ Genism library in Python. We trained emojiSpace on a corpus of more than 4
427
+ billion tweets and evaluated it by implementing sentiment analysis on a Twitter
428
+ dataset containing more than 67 million tweets as an extrinsic task. For this
429
+ task, we compared the performance of two different classifiers of random forest
430
+ (RF) and linear support vector machine (SVM). For evaluation, we compared
431
+ emojiSpace performance with two other pre-trained embeddings and demonstrated
432
+ that emojiSpace outperforms both.
433
+
434
+ ---------------
435
+
436
+ ### 27 Jan 2020 | [CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking](https://arxiv.org/abs/2001.07935) | [⬇️](https://arxiv.org/pdf/2001.07935)
437
+ *Grigori Fursin, Herve Guillou and Nicolas Essayan*
438
+
439
+ We present CodeReef - an open platform to share all the components necessary
440
+ to enable cross-platform MLOps (MLSysOps), i.e. automating the deployment of ML
441
+ models across diverse systems in the most efficient way. We also introduce the
442
+ CodeReef solution - a way to package and share models as non-virtualized,
443
+ portable, customizable and reproducible archive files. Such ML packages include
444
+ JSON meta description of models with all dependencies, Python APIs, CLI actions
445
+ and portable workflows necessary to automatically build, benchmark, test and
446
+ customize models across diverse platforms, AI frameworks, libraries, compilers
447
+ and datasets. We demonstrate several CodeReef solutions to automatically build,
448
+ run and measure object detection based on SSD-Mobilenets, TensorFlow and COCO
449
+ dataset from the latest MLPerf inference benchmark across a wide range of
450
+ platforms from Raspberry Pi, Android phones and IoT devices to data centers.
451
+ Our long-term goal is to help researchers share their new techniques as
452
+ production-ready packages along with research papers to participate in
453
+ collaborative and reproducible benchmarking, compare the different
454
+ ML/software/hardware stacks and select the most efficient ones on a Pareto
455
+ frontier using online CodeReef dashboards.
456
+
457
+ ---------------
458
+
459
+ ### 28 Feb 2024 | [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) | [⬇️](https://arxiv.org/pdf/2402.17553)
460
+ *Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov*
461
+
462
+ For decades, human-computer interaction has fundamentally been manual. Even
463
+ today, almost all productive work done on the computer necessitates human input
464
+ at every step. Autonomous virtual agents represent an exciting step in
465
+ automating many of these menial tasks. Virtual agents would empower users with
466
+ limited technical proficiency to harness the full possibilities of computer
467
+ systems. They could also enable the efficient streamlining of numerous computer
468
+ tasks, ranging from calendar management to complex travel bookings, with
469
+ minimal human intervention. In this paper, we introduce OmniACT, the
470
+ first-of-a-kind dataset and benchmark for assessing an agent's capability to
471
+ generate executable programs to accomplish computer tasks. Our scope extends
472
+ beyond traditional web automation, covering a diverse range of desktop
473
+ applications. The dataset consists of fundamental tasks such as "Play the next
474
+ song", as well as longer horizon tasks such as "Send an email to John Doe
475
+ mentioning the time and place to meet". Specifically, given a pair of screen
476
+ image and a visually-grounded natural language task, the goal is to generate a
477
+ script capable of fully executing the task. We run several strong baseline
478
+ language model agents on our benchmark. The strongest baseline, GPT-4, performs
479
+ the best on our benchmark However, its performance level still reaches only 15%
480
+ of the human proficiency in generating executable scripts capable of completing
481
+ the task, demonstrating the challenge of our task for conventional web agents.
482
+ Our benchmark provides a platform to measure and evaluate the progress of
483
+ language model agents in automating computer tasks and motivates future work
484
+ towards building multimodal models that bridge large language models and the
485
+ visual grounding of computer screens.
486
+
487
+ ---------------
488
+
489
+ ### 24 Mar 2021 | [Proactive Interaction Framework for Intelligent Social Receptionist Robots](https://arxiv.org/abs/2012.04832) | [⬇️](https://arxiv.org/pdf/2012.04832)
490
+ *Yang Xue, Fan Wang, Hao Tian, Min Zhao, Jiangyong Li, Haiqing Pan and Yueqiang Dong*
491
+
492
+ Proactive human-robot interaction (HRI) allows the receptionist robots to
493
+ actively greet people and offer services based on vision, which has been found
494
+ to improve acceptability and customer satisfaction. Existing approaches are
495
+ either based on multi-stage decision processes or based on end-to-end decision
496
+ models. However, the rule-based approaches require sedulous expert efforts and
497
+ only handle minimal pre-defined scenarios. On the other hand, existing works
498
+ with end-to-end models are limited to very general greetings or few behavior
499
+ patterns (typically less than 10). To address those challenges, we propose a
500
+ new end-to-end framework, the TransFormer with Visual Tokens for Human-Robot
501
+ Interaction (TFVT-HRI). The proposed framework extracts visual tokens of
502
+ relative objects from an RGB camera first. To ensure the correct interpretation
503
+ of the scenario, a transformer decision model is then employed to process the
504
+ visual tokens, which is augmented with the temporal and spatial information. It
505
+ predicts the appropriate action to take in each scenario and identifies the
506
+ right target. Our data is collected from an in-service receptionist robot in an
507
+ office building, which is then annotated by experts for appropriate proactive
508
+ behavior. The action set includes 1000+ diverse patterns by combining language,
509
+ emoji expression, and body motions. We compare our model with other SOTA
510
+ end-to-end models on both offline test sets and online user experiments in
511
+ realistic office building environments to validate this framework. It is
512
+ demonstrated that the decision model achieves SOTA performance in action
513
+ triggering and selection, resulting in more humanness and intelligence when
514
+ compared with the previous reactive reception policies.
515
+
516
+ ---------------
517
+
518
+ ### 15 Mar 2023 | [Sustainable Cloud Services for Verbal Interaction with Embodied Agents](https://arxiv.org/abs/2203.02606) | [⬇️](https://arxiv.org/pdf/2203.02606)
519
+ *Lucrezia Grassi, Carmine Tommaso Recchiuto, Antonio Sgorbissa*
520
+
521
+ This article presents the design and the implementation of a cloud system for
522
+ knowledge-based autonomous interaction devised for Social Robots and other
523
+ conversational agents. The system is particularly convenient for low-cost
524
+ robots and devices: it can be used as a stand-alone dialogue system or as an
525
+ integration to provide "background" dialogue capabilities to any preexisting
526
+ Natural Language Processing ability that the robot may already have as part of
527
+ its basic skills. By connecting to the cloud, developers are provided with a
528
+ sustainable solution to manage verbal interaction through a network connection,
529
+ with about 3,000 topics of conversation ready for "chit-chatting" and a library
530
+ of pre-cooked plans that only needs to be grounded into the robot's physical
531
+ capabilities. The system is structured as a set of REST API endpoints so that
532
+ it can be easily expanded by adding new APIs to improve the capabilities of the
533
+ clients connected to the cloud. Another key feature of the system is that it
534
+ has been designed to make the development of its clients straightforward: in
535
+ this way, multiple robots and devices can be easily endowed with the capability
536
+ of autonomously interacting with the user, understanding when to perform
537
+ specific actions, and exploiting all the information provided by cloud
538
+ services. The article outlines and discusses the results of the experiments
539
+ performed to assess the system's performance in terms of response time, paving
540
+ the way for its use both for research and market solutions. Links to
541
+ repositories with clients for ROS and popular robots such as Pepper and NAO are
542
+ available on request.
543
+
544
+ ---------------<s>[INST] Context:
545
+ 1. <b> AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents </b>
546
+ Abstract: In this study, our goal is to create interactive avatar agents that can
547
+ autonomously plan and animate nuanced facial movements realistically, from both
548
+ visual and behavioral perspectives. Given high-level inputs about the
549
+ environment and agent profile, our framework harnesses LLMs to produce a series
550
+ of detailed text descriptions of the avatar agents' facial motions. These
551
+ descriptions are then processed by our task-agnostic driving engine into motion
552
+ token sequences, which are subsequently converted into continuous motion
553
+ embeddings that are further consumed by our standalone neural-based renderer to
554
+ generate the final photorealistic avatar animations. These streamlined
555
+ processes allow our framework to adapt to a variety of non-verbal avatar
556
+ interactions, both monadic and dyadic. Our extensive study, which includes
557
+ experiments on both newly compiled and existing datasets featuring two types of
558
+ agents -- one capable of monadic interaction with the environment, and the
559
+ other designed for dyadic conversation -- validates the effectiveness and
560
+ versatility of our approach. To our knowledge, we advanced a leap step by
561
+ combining LLMs and neural rendering for generalized non-verbal prediction and
562
+ photo-realistic rendering of avatar agents.
563
+ 2. <b> Caption Anything: Interactive Image Description with Diverse Multimodal Controls </b>
564
+ Abstract: Controllable image captioning is an emerging multimodal topic that aims to
565
+ describe the image with natural language following human purpose,
566
+ $\textit{e.g.}$, looking at the specified regions or telling in a particular
567
+ text style. State-of-the-art methods are trained on annotated pairs of input
568
+ controls and output captions. However, the scarcity of such well-annotated
569
+ multimodal data largely limits their usability and scalability for interactive
570
+ AI systems. Leveraging unimodal instruction-following foundation models is a
571
+ promising alternative that benefits from broader sources of data. In this
572
+ paper, we present Caption AnyThing (CAT), a foundation model augmented image
573
+ captioning framework supporting a wide range of multimodel controls: 1) visual
574
+ controls, including points, boxes, and trajectories; 2) language controls, such
575
+ as sentiment, length, language, and factuality. Powered by Segment Anything
576
+ Model (SAM) and ChatGPT, we unify the visual and language prompts into a
577
+ modularized framework, enabling the flexible combination between different
578
+ controls. Extensive case studies demonstrate the user intention alignment
579
+ capabilities of our framework, shedding light on effective user interaction
580
+ modeling in vision-language applications. Our code is publicly available at
581
+ https://github.com/ttengwang/Caption-Anything.
582
+ 3. <b> Kosmos-2: Grounding Multimodal Large Language Models to the World </b>
583
+ Abstract: We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
584
+ capabilities of perceiving object descriptions (e.g., bounding boxes) and
585
+ grounding text to the visual world. Specifically, we represent refer
586
+ expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
587
+ object descriptions are sequences of location tokens. Together with multimodal
588
+ corpora, we construct large-scale data of grounded image-text pairs (called
589
+ GrIT) to train the model. In addition to the existing capabilities of MLLMs
590
+ (e.g., perceiving general modalities, following instructions, and performing
591
+ in-context learning), Kosmos-2 integrates the grounding capability into
592
+ downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
593
+ including (i) multimodal grounding, such as referring expression comprehension,
594
+ and phrase grounding, (ii) multimodal referring, such as referring expression
595
+ generation, (iii) perception-language tasks, and (iv) language understanding
596
+ and generation. This work lays out the foundation for the development of
597
+ Embodiment AI and sheds light on the big convergence of language, multimodal
598
+ perception, action, and world modeling, which is a key step toward artificial
599
+ general intelligence. Code and pretrained models are available at
600
+ https://aka.ms/kosmos-2.
601
+ 4. <b> ScreenAI: A Vision-Language Model for UI and Infographics Understanding </b>
602
+ Abstract: Screen user interfaces (UIs) and infographics, sharing similar visual
603
+ language and design principles, play important roles in human communication and
604
+ human-machine interaction. We introduce ScreenAI, a vision-language model that
605
+ specializes in UI and infographics understanding. Our model improves upon the
606
+ PaLI architecture with the flexible patching strategy of pix2struct and is
607
+ trained on a unique mixture of datasets. At the heart of this mixture is a
608
+ novel screen annotation task in which the model has to identify the type and
609
+ location of UI elements. We use these text annotations to describe screens to
610
+ Large Language Models and automatically generate question-answering (QA), UI
611
+ navigation, and summarization training datasets at scale. We run ablation
612
+ studies to demonstrate the impact of these design choices. At only 5B
613
+ parameters, ScreenAI achieves new state-of-the-artresults on UI- and
614
+ infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
615
+ Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
616
+ InfographicVQA) compared to models of similar size. Finally, we release three
617
+ new datasets: one focused on the screen annotation task and two others focused
618
+ on question answering.
619
+ 5. <b> ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues </b>
620
+ Abstract: Task-oriented conversational agents rely on semantic parsers to translate
621
+ natural language to formal representations. In this paper, we propose the
622
+ design and rationale of the ThingTalk formal representation, and how the design
623
+ improves the development of transactional task-oriented agents.
624
+ ThingTalk is built on four core principles: (1) representing user requests
625
+ directly as executable statements, covering all the functionality of the agent,
626
+ (2) representing dialogues formally and succinctly to support accurate
627
+ contextual semantic parsing, (3) standardizing types and interfaces to maximize
628
+ reuse between agents, and (4) allowing multiple, independently-developed agents
629
+ to be composed in a single virtual assistant. ThingTalk is developed as part of
630
+ the Genie Framework that allows developers to quickly build transactional
631
+ agents given a database and APIs.
632
+ We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
633
+ Compared to the others, the ThingTalk design is both more general and more
634
+ cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
635
+ associated tools yields a new state of the art accuracy of 79% turn-by-turn.
636
+ 6. <b> 3D-GPT: Procedural 3D Modeling with Large Language Models </b>
637
+ Abstract: In the pursuit of efficient automated content creation, procedural
638
+ generation, leveraging modifiable parameters and rule-based systems, emerges as
639
+ a promising approach. Nonetheless, it could be a demanding endeavor, given its
640
+ intricate nature necessitating a deep understanding of rules, algorithms, and
641
+ parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
642
+ large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
643
+ positions LLMs as proficient problem solvers, dissecting the procedural 3D
644
+ modeling tasks into accessible segments and appointing the apt agent for each
645
+ task. 3D-GPT integrates three core agents: the task dispatch agent, the
646
+ conceptualization agent, and the modeling agent. They collaboratively achieve
647
+ two objectives. First, it enhances concise initial scene descriptions, evolving
648
+ them into detailed forms while dynamically adapting the text based on
649
+ subsequent instructions. Second, it integrates procedural generation,
650
+ extracting parameter values from enriched text to effortlessly interface with
651
+ 3D software for asset creation. Our empirical investigations confirm that
652
+ 3D-GPT not only interprets and executes instructions, delivering reliable
653
+ results but also collaborates effectively with human designers. Furthermore, it
654
+ seamlessly integrates with Blender, unlocking expanded manipulation
655
+ possibilities. Our work highlights the potential of LLMs in 3D modeling,
656
+ offering a basic framework for future advancements in scene generation and
657
+ animation.
658
+ 7. <b> Embodied Task Planning with Large Language Models </b>
659
+ Abstract: Equipping embodied agents with commonsense is important for robots to
660
+ successfully complete complex human instructions in general environments.
661
+ Recent large language models (LLM) can embed rich semantic knowledge for agents
662
+ in plan generation of complex tasks, while they lack the information about the
663
+ realistic world and usually yield infeasible action sequences. In this paper,
664
+ we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
665
+ with physical scene constraint, where the agent generates executable plans
666
+ according to the existed objects in the scene by aligning LLMs with the visual
667
+ perception models. Specifically, we first construct a multimodal dataset
668
+ containing triplets of indoor scenes, instructions and action plans, where we
669
+ provide the designed prompts and the list of existing objects in the scene for
670
+ GPT-3.5 to generate a large number of instructions and corresponding planned
671
+ actions. The generated data is leveraged for grounded plan tuning of
672
+ pre-trained LLMs. During inference, we discover the objects in the scene by
673
+ extending open-vocabulary object detectors to multi-view RGB images collected
674
+ in different achievable locations. Experimental results show that the generated
675
+ plan from our TaPA framework can achieve higher success rate than LLaVA and
676
+ GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
677
+ planning in general and complex environments.
678
+ 8. <b> Joint Representation Learning for Text and 3D Point Cloud </b>
679
+ Abstract: Recent advancements in vision-language pre-training (e.g. CLIP) have shown
680
+ that vision models can benefit from language supervision. While many models
681
+ using language modality have achieved great success on 2D vision tasks, the
682
+ joint representation learning of 3D point cloud with text remains
683
+ under-explored due to the difficulty of 3D-Text data pair acquisition and the
684
+ irregularity of 3D data structure. In this paper, we propose a novel Text4Point
685
+ framework to construct language-guided 3D point cloud models. The key idea is
686
+ utilizing 2D images as a bridge to connect the point cloud and the language
687
+ modalities. The proposed Text4Point follows the pre-training and fine-tuning
688
+ paradigm. During the pre-training stage, we establish the correspondence of
689
+ images and point clouds based on the readily available RGB-D data and use
690
+ contrastive learning to align the image and point cloud representations.
691
+ Together with the well-aligned image and text features achieved by CLIP, the
692
+ point cloud features are implicitly aligned with the text embeddings. Further,
693
+ we propose a Text Querying Module to integrate language information into 3D
694
+ representation learning by querying text embeddings with point cloud features.
695
+ For fine-tuning, the model learns task-specific 3D representations under
696
+ informative language guidance from the label set without 2D images. Extensive
697
+ experiments demonstrate that our model shows consistent improvement on various
698
+ downstream tasks, such as point cloud semantic segmentation, instance
699
+ segmentation, and object detection. The code will be available here:
700
+ https://github.com/LeapLabTHU/Text4Point
701
+ 9. <b> Executable Code Actions Elicit Better LLM Agents </b>
702
+ Abstract: Large Language Model (LLM) agents, capable of performing a broad range of
703
+ actions, such as invoking tools and controlling robots, show great potential in
704
+ tackling real-world challenges. LLM agents are typically prompted to produce
705
+ actions by generating JSON or text in a pre-defined format, which is usually
706
+ limited by constrained action space (e.g., the scope of pre-defined tools) and
707
+ restricted flexibility (e.g., inability to compose multiple tools). This work
708
+ proposes to use executable Python code to consolidate LLM agents' actions into
709
+ a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
710
+ can execute code actions and dynamically revise prior actions or emit new
711
+ actions upon new observations through multi-turn interactions. Our extensive
712
+ analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
713
+ CodeAct outperforms widely used alternatives (up to 20% higher success rate).
714
+ The encouraging performance of CodeAct motivates us to build an open-source LLM
715
+ agent that interacts with environments by executing interpretable code and
716
+ collaborates with users using natural language. To this end, we collect an
717
+ instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
718
+ interactions using CodeAct. We show that it can be used with existing data to
719
+ improve models in agent-oriented tasks without compromising their general
720
+ capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
721
+ Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
722
+ model training) using existing libraries and autonomously self-debug.
723
+ 10. <b> VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks </b>
724
+ Abstract: Autonomous agents capable of planning, reasoning, and executing actions on
725
+ the web offer a promising avenue for automating computer tasks. However, the
726
+ majority of existing benchmarks primarily focus on text-based agents,
727
+ neglecting many natural tasks that require visual information to effectively
728
+ solve. Given that most computer interfaces cater to human perception, visual
729
+ information often augments textual data in ways that text-only models struggle
730
+ to harness effectively. To bridge this gap, we introduce VisualWebArena, a
731
+ benchmark designed to assess the performance of multimodal web agents on
732
+ realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set
733
+ of diverse and complex web-based tasks that evaluate various capabilities of
734
+ autonomous multimodal agents. To perform on this benchmark, agents need to
735
+ accurately process image-text inputs, interpret natural language instructions,
736
+ and execute actions on websites to accomplish user-defined objectives. We
737
+ conduct an extensive evaluation of state-of-the-art LLM-based autonomous
738
+ agents, including several multimodal models. Through extensive quantitative and
739
+ qualitative analysis, we identify several limitations of text-only LLM agents,
740
+ and reveal gaps in the capabilities of state-of-the-art multimodal language
741
+ agents. VisualWebArena provides a framework for evaluating multimodal
742
+ autonomous language agents, and offers insights towards building stronger
743
+ autonomous agents for the web.
744
+ """)