Spaces:

awacke1
/

Multi-Agent-Systems-MAS-Autonomous-Agents-Interacting

Runtime error

App Files Files Community

awacke1 commited on Apr 11, 2024

Commit

8d8a8b9

verified ·

1 Parent(s): 9e832a1

Create app.py

Browse files

Files changed (1) hide show

app.py +744 -0

app.py ADDED Viewed

	@@ -0,0 +1,744 @@

+import streamlit as st
+st.set_page_config(page_title="Multi Agent Systems", page_icon=":robot_face:", layout="wide")
+hide_streamlit_style = """
+            <style>
+            #MainMenu {visibility: hidden;}
+            footer {visibility: hidden;}
+            </style>
+            """
+st.markdown(hide_streamlit_style, unsafe_allow_html=True)
+col1, col2 = st.beta_columns(2)
+with col1:
+    st.markdown("## **Autonomous agents interacting** :robot_face: :robot_face:**")
+    st.markdown("### **Key Aspects** :bulb:")
+    st.markdown("""
+        1. **Interaction Protocol** 🤝 \n
+            - Define rules for communication and cooperation \n
+        2. **Decentralized Decision Making** 🎯 \n
+            - Autonomous agents make independent decisions \n
+        3. **Collaboration and Competition** 🤼 \n
+            - Agents work together or against each other \n
+    """)
+with col2:
+    st.markdown("### **Entities** :guards:")
+    st.markdown("""
+        1. **Autonomous Agents** 🤖 \n
+            - Independent entities with decision-making capabilities \n
+        2. **Environment** 🌐 \n
+            - Shared space where agents interact \n
+        3. **Ruleset** 📜 \n
+            - Defines interaction protocol and decision-making processes \n
+    """)
+st.markdown("---")
+st.markdown("## **Interaction Protocol** 🤝 :bulb:**")
+st.markdown("### **Key Elements** :guards:")
+st.markdown("""
+        1. **Communication** 🗣 \n
+            - Agents exchange information \n
+        2. **Cooperation** 🤝 \n
+            -# 🩺🔍 Search Results
+### 04 Dec 2023 | [AgentAvatar: Disentangling Planning, Driving and Rendering for  Photorealistic Avatar Agents](https://arxiv.org/abs/2311.17465) | [⬇️](https://arxiv.org/pdf/2311.17465)
+*Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang*
+  In this study, our goal is to create interactive avatar agents that can
+autonomously plan and animate nuanced facial movements realistically, from both
+visual and behavioral perspectives. Given high-level inputs about the
+environment and agent profile, our framework harnesses LLMs to produce a series
+of detailed text descriptions of the avatar agents' facial motions. These
+descriptions are then processed by our task-agnostic driving engine into motion
+token sequences, which are subsequently converted into continuous motion
+embeddings that are further consumed by our standalone neural-based renderer to
+generate the final photorealistic avatar animations. These streamlined
+processes allow our framework to adapt to a variety of non-verbal avatar
+interactions, both monadic and dyadic. Our extensive study, which includes
+experiments on both newly compiled and existing datasets featuring two types of
+agents -- one capable of monadic interaction with the environment, and the
+other designed for dyadic conversation -- validates the effectiveness and
+versatility of our approach. To our knowledge, we advanced a leap step by
+combining LLMs and neural rendering for generalized non-verbal prediction and
+photo-realistic rendering of avatar agents.
+---------------
+### 06 Jul 2023 | [Caption Anything: Interactive Image Description with Diverse Multimodal  Controls](https://arxiv.org/abs/2305.02677) | [⬇️](https://arxiv.org/pdf/2305.02677)
+*Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li,  Mingqi Gao, Shanshan Zhao*
+  Controllable image captioning is an emerging multimodal topic that aims to
+describe the image with natural language following human purpose,
+$\textit{e.g.}$, looking at the specified regions or telling in a particular
+text style. State-of-the-art methods are trained on annotated pairs of input
+controls and output captions. However, the scarcity of such well-annotated
+multimodal data largely limits their usability and scalability for interactive
+AI systems. Leveraging unimodal instruction-following foundation models is a
+promising alternative that benefits from broader sources of data. In this
+paper, we present Caption AnyThing (CAT), a foundation model augmented image
+captioning framework supporting a wide range of multimodel controls: 1) visual
+controls, including points, boxes, and trajectories; 2) language controls, such
+as sentiment, length, language, and factuality. Powered by Segment Anything
+Model (SAM) and ChatGPT, we unify the visual and language prompts into a
+modularized framework, enabling the flexible combination between different
+controls. Extensive case studies demonstrate the user intention alignment
+capabilities of our framework, shedding light on effective user interaction
+modeling in vision-language applications. Our code is publicly available at
+https://github.com/ttengwang/Caption-Anything.
+---------------
+### 13 Jul 2023 | [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) | [⬇️](https://arxiv.org/pdf/2306.14824)
+*Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming  Ma, Furu Wei*
+  We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
+capabilities of perceiving object descriptions (e.g., bounding boxes) and
+grounding text to the visual world. Specifically, we represent refer
+expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
+object descriptions are sequences of location tokens. Together with multimodal
+corpora, we construct large-scale data of grounded image-text pairs (called
+GrIT) to train the model. In addition to the existing capabilities of MLLMs
+(e.g., perceiving general modalities, following instructions, and performing
+in-context learning), Kosmos-2 integrates the grounding capability into
+downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
+including (i) multimodal grounding, such as referring expression comprehension,
+and phrase grounding, (ii) multimodal referring, such as referring expression
+generation, (iii) perception-language tasks, and (iv) language understanding
+and generation. This work lays out the foundation for the development of
+Embodiment AI and sheds light on the big convergence of language, multimodal
+perception, action, and world modeling, which is a key step toward artificial
+general intelligence. Code and pretrained models are available at
+https://aka.ms/kosmos-2.
+---------------
+### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [⬇️](https://arxiv.org/pdf/2402.04615)
+*Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan  Mansoor, Vincent Etter, Victor C\u{a}rbune, Jason Lin, Jindong Chen, Abhanshu  Sharma*
+  Screen user interfaces (UIs) and infographics, sharing similar visual
+language and design principles, play important roles in human communication and
+human-machine interaction. We introduce ScreenAI, a vision-language model that
+specializes in UI and infographics understanding. Our model improves upon the
+PaLI architecture with the flexible patching strategy of pix2struct and is
+trained on a unique mixture of datasets. At the heart of this mixture is a
+novel screen annotation task in which the model has to identify the type and
+location of UI elements. We use these text annotations to describe screens to
+Large Language Models and automatically generate question-answering (QA), UI
+navigation, and summarization training datasets at scale. We run ablation
+studies to demonstrate the impact of these design choices. At only 5B
+parameters, ScreenAI achieves new state-of-the-artresults on UI- and
+infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
+Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
+InfographicVQA) compared to models of similar size. Finally, we release three
+new datasets: one focused on the screen annotation task and two others focused
+on question answering.
+---------------
+### 23 Mar 2022 | [ThingTalk: An Extensible, Executable Representation Language for  Task-Oriented Dialogues](https://arxiv.org/abs/2203.12751) | [⬇️](https://arxiv.org/pdf/2203.12751)
+*Monica S. Lam, Giovanni Campagna, Mehrad Moradshahi, Sina J. Semnani,  Silei Xu*
+  Task-oriented conversational agents rely on semantic parsers to translate
+natural language to formal representations. In this paper, we propose the
+design and rationale of the ThingTalk formal representation, and how the design
+improves the development of transactional task-oriented agents.
+  ThingTalk is built on four core principles: (1) representing user requests
+directly as executable statements, covering all the functionality of the agent,
+(2) representing dialogues formally and succinctly to support accurate
+contextual semantic parsing, (3) standardizing types and interfaces to maximize
+reuse between agents, and (4) allowing multiple, independently-developed agents
+to be composed in a single virtual assistant. ThingTalk is developed as part of
+the Genie Framework that allows developers to quickly build transactional
+agents given a database and APIs.
+  We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
+Compared to the others, the ThingTalk design is both more general and more
+cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
+associated tools yields a new state of the art accuracy of 79% turn-by-turn.
+---------------
+### 19 Oct 2023 | [3D-GPT: Procedural 3D Modeling with Large Language Models](https://arxiv.org/abs/2310.12945) | [⬇️](https://arxiv.org/pdf/2310.12945)
+*Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin,  Stephen Gould*
+  In the pursuit of efficient automated content creation, procedural
+generation, leveraging modifiable parameters and rule-based systems, emerges as
+a promising approach. Nonetheless, it could be a demanding endeavor, given its
+intricate nature necessitating a deep understanding of rules, algorithms, and
+parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
+large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
+positions LLMs as proficient problem solvers, dissecting the procedural 3D
+modeling tasks into accessible segments and appointing the apt agent for each
+task. 3D-GPT integrates three core agents: the task dispatch agent, the
+conceptualization agent, and the modeling agent. They collaboratively achieve
+two objectives. First, it enhances concise initial scene descriptions, evolving
+them into detailed forms while dynamically adapting the text based on
+subsequent instructions. Second, it integrates procedural generation,
+extracting parameter values from enriched text to effortlessly interface with
+3D software for asset creation. Our empirical investigations confirm that
+3D-GPT not only interprets and executes instructions, delivering reliable
+results but also collaborates effectively with human designers. Furthermore, it
+seamlessly integrates with Blender, unlocking expanded manipulation
+possibilities. Our work highlights the potential of LLMs in 3D modeling,
+offering a basic framework for future advancements in scene generation and
+animation.
+---------------
+### 04 Jul 2023 | [Embodied Task Planning with Large Language Models](https://arxiv.org/abs/2307.01848) | [⬇️](https://arxiv.org/pdf/2307.01848)
+*Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan*
+  Equipping embodied agents with commonsense is important for robots to
+successfully complete complex human instructions in general environments.
+Recent large language models (LLM) can embed rich semantic knowledge for agents
+in plan generation of complex tasks, while they lack the information about the
+realistic world and usually yield infeasible action sequences. In this paper,
+we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
+with physical scene constraint, where the agent generates executable plans
+according to the existed objects in the scene by aligning LLMs with the visual
+perception models. Specifically, we first construct a multimodal dataset
+containing triplets of indoor scenes, instructions and action plans, where we
+provide the designed prompts and the list of existing objects in the scene for
+GPT-3.5 to generate a large number of instructions and corresponding planned
+actions. The generated data is leveraged for grounded plan tuning of
+pre-trained LLMs. During inference, we discover the objects in the scene by
+extending open-vocabulary object detectors to multi-view RGB images collected
+in different achievable locations. Experimental results show that the generated
+plan from our TaPA framework can achieve higher success rate than LLaVA and
+GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
+planning in general and complex environments.
+---------------
+### 18 Jan 2023 | [Joint Representation Learning for Text and 3D Point Cloud](https://arxiv.org/abs/2301.07584) | [⬇️](https://arxiv.org/pdf/2301.07584)
+*Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji  Song, Gao Huang*
+  Recent advancements in vision-language pre-training (e.g. CLIP) have shown
+that vision models can benefit from language supervision. While many models
+using language modality have achieved great success on 2D vision tasks, the
+joint representation learning of 3D point cloud with text remains
+under-explored due to the difficulty of 3D-Text data pair acquisition and the
+irregularity of 3D data structure. In this paper, we propose a novel Text4Point
+framework to construct language-guided 3D point cloud models. The key idea is
+utilizing 2D images as a bridge to connect the point cloud and the language
+modalities. The proposed Text4Point follows the pre-training and fine-tuning
+paradigm. During the pre-training stage, we establish the correspondence of
+images and point clouds based on the readily available RGB-D data and use
+contrastive learning to align the image and point cloud representations.
+Together with the well-aligned image and text features achieved by CLIP, the
+point cloud features are implicitly aligned with the text embeddings. Further,
+we propose a Text Querying Module to integrate language information into 3D
+representation learning by querying text embeddings with point cloud features.
+For fine-tuning, the model learns task-specific 3D representations under
+informative language guidance from the label set without 2D images. Extensive
+experiments demonstrate that our model shows consistent improvement on various
+downstream tasks, such as point cloud semantic segmentation, instance
+segmentation, and object detection. The code will be available here:
+https://github.com/LeapLabTHU/Text4Point
+---------------
+### 01 Feb 2024 | [Executable Code Actions Elicit Better LLM Agents](https://arxiv.org/abs/2402.01030) | [⬇️](https://arxiv.org/pdf/2402.01030)
+*Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao  Peng, Heng Ji*
+  Large Language Model (LLM) agents, capable of performing a broad range of
+actions, such as invoking tools and controlling robots, show great potential in
+tackling real-world challenges. LLM agents are typically prompted to produce
+actions by generating JSON or text in a pre-defined format, which is usually
+limited by constrained action space (e.g., the scope of pre-defined tools) and
+restricted flexibility (e.g., inability to compose multiple tools). This work
+proposes to use executable Python code to consolidate LLM agents' actions into
+a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
+can execute code actions and dynamically revise prior actions or emit new
+actions upon new observations through multi-turn interactions. Our extensive
+analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
+CodeAct outperforms widely used alternatives (up to 20% higher success rate).
+The encouraging performance of CodeAct motivates us to build an open-source LLM
+agent that interacts with environments by executing interpretable code and
+collaborates with users using natural language. To this end, we collect an
+instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
+interactions using CodeAct. We show that it can be used with existing data to
+improve models in agent-oriented tasks without compromising their general
+capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
+Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
+model training) using existing libraries and autonomously self-debug.
+---------------
+### 24 Jan 2024 | [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web  Tasks](https://arxiv.org/abs/2401.13649) | [⬇️](https://arxiv.org/pdf/2401.13649)
+*Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim,  Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried*
+  Autonomous agents capable of planning, reasoning, and executing actions on
+the web offer a promising avenue for automating computer tasks. However, the
+majority of existing benchmarks primarily focus on text-based agents,
+neglecting many natural tasks that require visual information to effectively
+solve. Given that most computer interfaces cater to human perception, visual
+information often augments textual data in ways that text-only models struggle
+to harness effectively. To bridge this gap, we introduce VisualWebArena, a
+benchmark designed to assess the performance of multimodal web agents on
+realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set
+of diverse and complex web-based tasks that evaluate various capabilities of
+autonomous multimodal agents. To perform on this benchmark, agents need to
+accurately process image-text inputs, interpret natural language instructions,
+and execute actions on websites to accomplish user-defined objectives. We
+conduct an extensive evaluation of state-of-the-art LLM-based autonomous
+agents, including several multimodal models. Through extensive quantitative and
+qualitative analysis, we identify several limitations of text-only LLM agents,
+and reveal gaps in the capabilities of state-of-the-art multimodal language
+agents. VisualWebArena provides a framework for evaluating multimodal
+autonomous language agents, and offers insights towards building stronger
+autonomous agents for the web. Our code, baseline models, and data is publicly
+available at https://jykoh.com/vwa.
+---------------
+### 22 Feb 2018 | [Multimodal Named Entity Recognition for Short Social Media Posts](https://arxiv.org/abs/1802.07862) | [⬇️](https://arxiv.org/pdf/1802.07862)
+*Seungwhan Moon, Leonardo Neves, Vitor Carvalho*
+  We introduce a new task called Multimodal Named Entity Recognition (MNER) for
+noisy user-generated data such as tweets or Snapchat captions, which comprise
+short text with accompanying images. These social media posts often come in
+inconsistent or incomplete syntax and lexical notations with very limited
+surrounding textual contexts, bringing significant challenges for NER. To this
+end, we create a new dataset for MNER called SnapCaptions (Snapchat
+image-caption pairs submitted to public and crowd-sourced stories with fully
+annotated named entities). We then build upon the state-of-the-art Bi-LSTM
+word/character based NER models with 1) a deep image network which incorporates
+relevant visual context to augment textual information, and 2) a generic
+modality-attention module which learns to attenuate irrelevant modalities while
+amplifying the most informative ones to extract contexts from, adaptive to each
+sample and token. The proposed MNER model with modality attention significantly
+outperforms the state-of-the-art text-only NER models by successfully
+leveraging provided visual contexts, opening up potential applications of MNER
+on myriads of social media platforms.
+---------------
+### 21 Sep 2023 | [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) | [⬇️](https://arxiv.org/pdf/2309.11436)
+*Zhuosheng Zhang, Aston Zhang*
+  Autonomous user interface (UI) agents aim to facilitate task automation by
+interacting with the user interface without manual intervention. Recent studies
+have investigated eliciting the capabilities of large language models (LLMs)
+for effective engagement in diverse environments. To align with the
+input-output requirement of LLMs, existing approaches are developed under a
+sandbox setting where they rely on external tools and application-specific APIs
+to parse the environment into textual elements and interpret the predicted
+actions. Consequently, those approaches often grapple with inference
+inefficiency and error propagation risks. To mitigate the challenges, we
+introduce Auto-UI, a multimodal solution that directly interacts with the
+interface, bypassing the need for environment parsing or reliance on
+application-dependent APIs. Moreover, we propose a chain-of-action technique --
+leveraging a series of intermediate previous action histories and future action
+plans -- to help the agent decide what action to execute. We evaluate our
+approach on a new device-control benchmark AITW with 30K unique instructions,
+spanning multi-step tasks such as application operation, web searching, and web
+shopping. Experimental results show that Auto-UI achieves state-of-the-art
+performance with an action type prediction accuracy of 90% and an overall
+action success rate of 74%. Code is publicly available at
+https://github.com/cooelf/Auto-UI.
+---------------
+### 06 Jun 2023 | [LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations  and Infographics using Large Language Models](https://arxiv.org/abs/2303.02927) | [⬇️](https://arxiv.org/pdf/2303.02927)
+*Victor Dibia*
+  Systems that support users in the automatic creation of visualizations must
+address several subtasks - understand the semantics of data, enumerate relevant
+visualization goals and generate visualization specifications. In this work, we
+pose visualization generation as a multi-stage generation problem and argue
+that well-orchestrated pipelines based on large language models (LLMs) such as
+ChatGPT/GPT-4 and image generation models (IGMs) are suitable to addressing
+these tasks. We present LIDA, a novel tool for generating grammar-agnostic
+visualizations and infographics. LIDA comprises of 4 modules - A SUMMARIZER
+that converts data into a rich but compact natural language summary, a GOAL
+EXPLORER that enumerates visualization goals given the data, a VISGENERATOR
+that generates, refines, executes and filters visualization code and an
+INFOGRAPHER module that yields data-faithful stylized graphics using IGMs. LIDA
+provides a python api, and a hybrid user interface (direct manipulation and
+multilingual natural language) for interactive chart, infographics and data
+story generation. Learn more about the project here -
+https://microsoft.github.io/lida/
+---------------
+### 16 Feb 2023 | [VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video  Paragraph Captioning](https://arxiv.org/abs/2211.15103) | [⬇️](https://arxiv.org/pdf/2211.15103)
+*Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le*
+  Video paragraph captioning aims to generate a multi-sentence description of
+an untrimmed video with several temporal event locations in coherent
+storytelling. Following the human perception process, where the scene is
+effectively understood by decomposing it into visual (e.g. human, animal) and
+non-visual components (e.g. action, relations) under the mutual influence of
+vision and language, we first propose a visual-linguistic (VL) feature. In the
+proposed VL feature, the scene is modeled by three modalities including (i) a
+global visual environment; (ii) local visual main agents; (iii) linguistic
+scene elements. We then introduce an autoregressive Transformer-in-Transformer
+(TinT) to simultaneously capture the semantic coherence of intra- and
+inter-event contents within a video. Finally, we present a new VL contrastive
+loss function to guarantee learnt embedding features are matched with the
+captions semantics. Comprehensive experiments and extensive ablation studies on
+ActivityNet Captions and YouCookII datasets show that the proposed
+Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior
+state-of-the-art methods on accuracy and diversity. Source code is made
+publicly available at: https://github.com/UARK-AICV/VLTinT.
+---------------
+### 04 Mar 2021 | [FAtiMA Toolkit -- Toward an effective and accessible tool for the  development of intelligent virtual agents and social robots](https://arxiv.org/abs/2103.03020) | [⬇️](https://arxiv.org/pdf/2103.03020)
+*Samuel Mascarenhas, Manuel Guimar\~aes, Pedro A. Santos, Jo\~ao Dias,  Rui Prada, Ana Paiva*
+  More than a decade has passed since the development of FearNot!, an
+application designed to help children deal with bullying through role-playing
+with virtual characters. It was also the application that led to the creation
+of FAtiMA, an affective agent architecture for creating autonomous characters
+that can evoke empathic responses. In this paper, we describe FAtiMA Toolkit, a
+collection of open-source tools that is designed to help researchers, game
+developers and roboticists incorporate a computational model of emotion and
+decision-making in their work. The toolkit was developed with the goal of
+making FAtiMA more accessible, easier to incorporate into different projects
+and more flexible in its capabilities for human-agent interaction, based upon
+the experience gathered over the years across different virtual environments
+and human-robot interaction scenarios. As a result, this work makes several
+different contributions to the field of Agent-Based Architectures. More
+precisely, FAtiMA Toolkit's library based design allows developers to easily
+integrate it with other frameworks, its meta-cognitive model affords different
+internal reasoners and affective components and its explicit dialogue structure
+gives control to the author even within highly complex scenarios. To
+demonstrate the use of FAtiMA Toolkit, several different use cases where the
+toolkit was successfully applied are described and discussed.
+---------------
+### 12 Sep 2022 | [emojiSpace: Spatial Representation of Emojis](https://arxiv.org/abs/2209.09871) | [⬇️](https://arxiv.org/pdf/2209.09871)
+*Moeen Mostafavi, Mahsa Pahlavikhah Varnosfaderani, Fateme Nikseresht,  Seyed Ahmad Mansouri*
+  In the absence of nonverbal cues during messaging communication, users
+express part of their emotions using emojis. Thus, having emojis in the
+vocabulary of text messaging language models can significantly improve many
+natural language processing (NLP) applications such as online communication
+analysis. On the other hand, word embedding models are usually trained on a
+very large corpus of text such as Wikipedia or Google News datasets that
+include very few samples with emojis. In this study, we create emojiSpace,
+which is a combined word-emoji embedding using the word2vec model from the
+Genism library in Python. We trained emojiSpace on a corpus of more than 4
+billion tweets and evaluated it by implementing sentiment analysis on a Twitter
+dataset containing more than 67 million tweets as an extrinsic task. For this
+task, we compared the performance of two different classifiers of random forest
+(RF) and linear support vector machine (SVM). For evaluation, we compared
+emojiSpace performance with two other pre-trained embeddings and demonstrated
+that emojiSpace outperforms both.
+---------------
+### 27 Jan 2020 | [CodeReef: an open platform for portable MLOps, reusable automation  actions and reproducible benchmarking](https://arxiv.org/abs/2001.07935) | [⬇️](https://arxiv.org/pdf/2001.07935)
+*Grigori Fursin, Herve Guillou and Nicolas Essayan*
+  We present CodeReef - an open platform to share all the components necessary
+to enable cross-platform MLOps (MLSysOps), i.e. automating the deployment of ML
+models across diverse systems in the most efficient way. We also introduce the
+CodeReef solution - a way to package and share models as non-virtualized,
+portable, customizable and reproducible archive files. Such ML packages include
+JSON meta description of models with all dependencies, Python APIs, CLI actions
+and portable workflows necessary to automatically build, benchmark, test and
+customize models across diverse platforms, AI frameworks, libraries, compilers
+and datasets. We demonstrate several CodeReef solutions to automatically build,
+run and measure object detection based on SSD-Mobilenets, TensorFlow and COCO
+dataset from the latest MLPerf inference benchmark across a wide range of
+platforms from Raspberry Pi, Android phones and IoT devices to data centers.
+Our long-term goal is to help researchers share their new techniques as
+production-ready packages along with research papers to participate in
+collaborative and reproducible benchmarking, compare the different
+ML/software/hardware stacks and select the most efficient ones on a Pareto
+frontier using online CodeReef dashboards.
+---------------
+### 28 Feb 2024 | [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist  Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) | [⬇️](https://arxiv.org/pdf/2402.17553)
+*Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran  Kamble, Waseem Alshikh, Ruslan Salakhutdinov*
+  For decades, human-computer interaction has fundamentally been manual. Even
+today, almost all productive work done on the computer necessitates human input
+at every step. Autonomous virtual agents represent an exciting step in
+automating many of these menial tasks. Virtual agents would empower users with
+limited technical proficiency to harness the full possibilities of computer
+systems. They could also enable the efficient streamlining of numerous computer
+tasks, ranging from calendar management to complex travel bookings, with
+minimal human intervention. In this paper, we introduce OmniACT, the
+first-of-a-kind dataset and benchmark for assessing an agent's capability to
+generate executable programs to accomplish computer tasks. Our scope extends
+beyond traditional web automation, covering a diverse range of desktop
+applications. The dataset consists of fundamental tasks such as "Play the next
+song", as well as longer horizon tasks such as "Send an email to John Doe
+mentioning the time and place to meet". Specifically, given a pair of screen
+image and a visually-grounded natural language task, the goal is to generate a
+script capable of fully executing the task. We run several strong baseline
+language model agents on our benchmark. The strongest baseline, GPT-4, performs
+the best on our benchmark However, its performance level still reaches only 15%
+of the human proficiency in generating executable scripts capable of completing
+the task, demonstrating the challenge of our task for conventional web agents.
+Our benchmark provides a platform to measure and evaluate the progress of
+language model agents in automating computer tasks and motivates future work
+towards building multimodal models that bridge large language models and the
+visual grounding of computer screens.
+---------------
+### 24 Mar 2021 | [Proactive Interaction Framework for Intelligent Social Receptionist  Robots](https://arxiv.org/abs/2012.04832) | [⬇️](https://arxiv.org/pdf/2012.04832)
+*Yang Xue, Fan Wang, Hao Tian, Min Zhao, Jiangyong Li, Haiqing Pan and  Yueqiang Dong*
+  Proactive human-robot interaction (HRI) allows the receptionist robots to
+actively greet people and offer services based on vision, which has been found
+to improve acceptability and customer satisfaction. Existing approaches are
+either based on multi-stage decision processes or based on end-to-end decision
+models. However, the rule-based approaches require sedulous expert efforts and
+only handle minimal pre-defined scenarios. On the other hand, existing works
+with end-to-end models are limited to very general greetings or few behavior
+patterns (typically less than 10). To address those challenges, we propose a
+new end-to-end framework, the TransFormer with Visual Tokens for Human-Robot
+Interaction (TFVT-HRI). The proposed framework extracts visual tokens of
+relative objects from an RGB camera first. To ensure the correct interpretation
+of the scenario, a transformer decision model is then employed to process the
+visual tokens, which is augmented with the temporal and spatial information. It
+predicts the appropriate action to take in each scenario and identifies the
+right target. Our data is collected from an in-service receptionist robot in an
+office building, which is then annotated by experts for appropriate proactive
+behavior. The action set includes 1000+ diverse patterns by combining language,
+emoji expression, and body motions. We compare our model with other SOTA
+end-to-end models on both offline test sets and online user experiments in
+realistic office building environments to validate this framework. It is
+demonstrated that the decision model achieves SOTA performance in action
+triggering and selection, resulting in more humanness and intelligence when
+compared with the previous reactive reception policies.
+---------------
+### 15 Mar 2023 | [Sustainable Cloud Services for Verbal Interaction with Embodied Agents](https://arxiv.org/abs/2203.02606) | [⬇️](https://arxiv.org/pdf/2203.02606)
+*Lucrezia Grassi, Carmine Tommaso Recchiuto, Antonio Sgorbissa*
+  This article presents the design and the implementation of a cloud system for
+knowledge-based autonomous interaction devised for Social Robots and other
+conversational agents. The system is particularly convenient for low-cost
+robots and devices: it can be used as a stand-alone dialogue system or as an
+integration to provide "background" dialogue capabilities to any preexisting
+Natural Language Processing ability that the robot may already have as part of
+its basic skills. By connecting to the cloud, developers are provided with a
+sustainable solution to manage verbal interaction through a network connection,
+with about 3,000 topics of conversation ready for "chit-chatting" and a library
+of pre-cooked plans that only needs to be grounded into the robot's physical
+capabilities. The system is structured as a set of REST API endpoints so that
+it can be easily expanded by adding new APIs to improve the capabilities of the
+clients connected to the cloud. Another key feature of the system is that it
+has been designed to make the development of its clients straightforward: in
+this way, multiple robots and devices can be easily endowed with the capability
+of autonomously interacting with the user, understanding when to perform
+specific actions, and exploiting all the information provided by cloud
+services. The article outlines and discusses the results of the experiments
+performed to assess the system's performance in terms of response time, paving
+the way for its use both for research and market solutions. Links to
+repositories with clients for ROS and popular robots such as Pepper and NAO are
+available on request.
+---------------<s>[INST] Context:
+ 1. <b> AgentAvatar: Disentangling Planning, Driving and Rendering for  Photorealistic Avatar Agents </b>
+ Abstract:   In this study, our goal is to create interactive avatar agents that can
+autonomously plan and animate nuanced facial movements realistically, from both
+visual and behavioral perspectives. Given high-level inputs about the
+environment and agent profile, our framework harnesses LLMs to produce a series
+of detailed text descriptions of the avatar agents' facial motions. These
+descriptions are then processed by our task-agnostic driving engine into motion
+token sequences, which are subsequently converted into continuous motion
+embeddings that are further consumed by our standalone neural-based renderer to
+generate the final photorealistic avatar animations. These streamlined
+processes allow our framework to adapt to a variety of non-verbal avatar
+interactions, both monadic and dyadic. Our extensive study, which includes
+experiments on both newly compiled and existing datasets featuring two types of
+agents -- one capable of monadic interaction with the environment, and the
+other designed for dyadic conversation -- validates the effectiveness and
+versatility of our approach. To our knowledge, we advanced a leap step by
+combining LLMs and neural rendering for generalized non-verbal prediction and
+photo-realistic rendering of avatar agents.
+2. <b> Caption Anything: Interactive Image Description with Diverse Multimodal  Controls </b>
+ Abstract:   Controllable image captioning is an emerging multimodal topic that aims to
+describe the image with natural language following human purpose,
+$\textit{e.g.}$, looking at the specified regions or telling in a particular
+text style. State-of-the-art methods are trained on annotated pairs of input
+controls and output captions. However, the scarcity of such well-annotated
+multimodal data largely limits their usability and scalability for interactive
+AI systems. Leveraging unimodal instruction-following foundation models is a
+promising alternative that benefits from broader sources of data. In this
+paper, we present Caption AnyThing (CAT), a foundation model augmented image
+captioning framework supporting a wide range of multimodel controls: 1) visual
+controls, including points, boxes, and trajectories; 2) language controls, such
+as sentiment, length, language, and factuality. Powered by Segment Anything
+Model (SAM) and ChatGPT, we unify the visual and language prompts into a
+modularized framework, enabling the flexible combination between different
+controls. Extensive case studies demonstrate the user intention alignment
+capabilities of our framework, shedding light on effective user interaction
+modeling in vision-language applications. Our code is publicly available at
+https://github.com/ttengwang/Caption-Anything.
+3. <b> Kosmos-2: Grounding Multimodal Large Language Models to the World </b>
+ Abstract:   We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
+capabilities of perceiving object descriptions (e.g., bounding boxes) and
+grounding text to the visual world. Specifically, we represent refer
+expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
+object descriptions are sequences of location tokens. Together with multimodal
+corpora, we construct large-scale data of grounded image-text pairs (called
+GrIT) to train the model. In addition to the existing capabilities of MLLMs
+(e.g., perceiving general modalities, following instructions, and performing
+in-context learning), Kosmos-2 integrates the grounding capability into
+downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
+including (i) multimodal grounding, such as referring expression comprehension,
+and phrase grounding, (ii) multimodal referring, such as referring expression
+generation, (iii) perception-language tasks, and (iv) language understanding
+and generation. This work lays out the foundation for the development of
+Embodiment AI and sheds light on the big convergence of language, multimodal
+perception, action, and world modeling, which is a key step toward artificial
+general intelligence. Code and pretrained models are available at
+https://aka.ms/kosmos-2.
+4. <b> ScreenAI: A Vision-Language Model for UI and Infographics Understanding </b>
+ Abstract:   Screen user interfaces (UIs) and infographics, sharing similar visual
+language and design principles, play important roles in human communication and
+human-machine interaction. We introduce ScreenAI, a vision-language model that
+specializes in UI and infographics understanding. Our model improves upon the
+PaLI architecture with the flexible patching strategy of pix2struct and is
+trained on a unique mixture of datasets. At the heart of this mixture is a
+novel screen annotation task in which the model has to identify the type and
+location of UI elements. We use these text annotations to describe screens to
+Large Language Models and automatically generate question-answering (QA), UI
+navigation, and summarization training datasets at scale. We run ablation
+studies to demonstrate the impact of these design choices. At only 5B
+parameters, ScreenAI achieves new state-of-the-artresults on UI- and
+infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
+Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
+InfographicVQA) compared to models of similar size. Finally, we release three
+new datasets: one focused on the screen annotation task and two others focused
+on question answering.
+5. <b> ThingTalk: An Extensible, Executable Representation Language for  Task-Oriented Dialogues </b>
+ Abstract:   Task-oriented conversational agents rely on semantic parsers to translate
+natural language to formal representations. In this paper, we propose the
+design and rationale of the ThingTalk formal representation, and how the design
+improves the development of transactional task-oriented agents.
+  ThingTalk is built on four core principles: (1) representing user requests
+directly as executable statements, covering all the functionality of the agent,
+(2) representing dialogues formally and succinctly to support accurate
+contextual semantic parsing, (3) standardizing types and interfaces to maximize
+reuse between agents, and (4) allowing multiple, independently-developed agents
+to be composed in a single virtual assistant. ThingTalk is developed as part of
+the Genie Framework that allows developers to quickly build transactional
+agents given a database and APIs.
+  We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
+Compared to the others, the ThingTalk design is both more general and more
+cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
+associated tools yields a new state of the art accuracy of 79% turn-by-turn.
+6. <b> 3D-GPT: Procedural 3D Modeling with Large Language Models </b>
+ Abstract:   In the pursuit of efficient automated content creation, procedural
+generation, leveraging modifiable parameters and rule-based systems, emerges as
+a promising approach. Nonetheless, it could be a demanding endeavor, given its
+intricate nature necessitating a deep understanding of rules, algorithms, and
+parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
+large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
+positions LLMs as proficient problem solvers, dissecting the procedural 3D
+modeling tasks into accessible segments and appointing the apt agent for each
+task. 3D-GPT integrates three core agents: the task dispatch agent, the
+conceptualization agent, and the modeling agent. They collaboratively achieve
+two objectives. First, it enhances concise initial scene descriptions, evolving
+them into detailed forms while dynamically adapting the text based on
+subsequent instructions. Second, it integrates procedural generation,
+extracting parameter values from enriched text to effortlessly interface with
+3D software for asset creation. Our empirical investigations confirm that
+3D-GPT not only interprets and executes instructions, delivering reliable
+results but also collaborates effectively with human designers. Furthermore, it
+seamlessly integrates with Blender, unlocking expanded manipulation
+possibilities. Our work highlights the potential of LLMs in 3D modeling,
+offering a basic framework for future advancements in scene generation and
+animation.
+7. <b> Embodied Task Planning with Large Language Models </b>
+ Abstract:   Equipping embodied agents with commonsense is important for robots to
+successfully complete complex human instructions in general environments.
+Recent large language models (LLM) can embed rich semantic knowledge for agents
+in plan generation of complex tasks, while they lack the information about the
+realistic world and usually yield infeasible action sequences. In this paper,
+we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
+with physical scene constraint, where the agent generates executable plans
+according to the existed objects in the scene by aligning LLMs with the visual
+perception models. Specifically, we first construct a multimodal dataset
+containing triplets of indoor scenes, instructions and action plans, where we
+provide the designed prompts and the list of existing objects in the scene for
+GPT-3.5 to generate a large number of instructions and corresponding planned
+actions. The generated data is leveraged for grounded plan tuning of
+pre-trained LLMs. During inference, we discover the objects in the scene by
+extending open-vocabulary object detectors to multi-view RGB images collected
+in different achievable locations. Experimental results show that the generated
+plan from our TaPA framework can achieve higher success rate than LLaVA and
+GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
+planning in general and complex environments.
+8. <b> Joint Representation Learning for Text and 3D Point Cloud </b>
+ Abstract:   Recent advancements in vision-language pre-training (e.g. CLIP) have shown
+that vision models can benefit from language supervision. While many models
+using language modality have achieved great success on 2D vision tasks, the
+joint representation learning of 3D point cloud with text remains
+under-explored due to the difficulty of 3D-Text data pair acquisition and the
+irregularity of 3D data structure. In this paper, we propose a novel Text4Point
+framework to construct language-guided 3D point cloud models. The key idea is
+utilizing 2D images as a bridge to connect the point cloud and the language
+modalities. The proposed Text4Point follows the pre-training and fine-tuning
+paradigm. During the pre-training stage, we establish the correspondence of
+images and point clouds based on the readily available RGB-D data and use
+contrastive learning to align the image and point cloud representations.
+Together with the well-aligned image and text features achieved by CLIP, the
+point cloud features are implicitly aligned with the text embeddings. Further,
+we propose a Text Querying Module to integrate language information into 3D
+representation learning by querying text embeddings with point cloud features.
+For fine-tuning, the model learns task-specific 3D representations under
+informative language guidance from the label set without 2D images. Extensive
+experiments demonstrate that our model shows consistent improvement on various
+downstream tasks, such as point cloud semantic segmentation, instance
+segmentation, and object detection. The code will be available here:
+https://github.com/LeapLabTHU/Text4Point
+9. <b> Executable Code Actions Elicit Better LLM Agents </b>
+ Abstract:   Large Language Model (LLM) agents, capable of performing a broad range of
+actions, such as invoking tools and controlling robots, show great potential in
+tackling real-world challenges. LLM agents are typically prompted to produce
+actions by generating JSON or text in a pre-defined format, which is usually
+limited by constrained action space (e.g., the scope of pre-defined tools) and
+restricted flexibility (e.g., inability to compose multiple tools). This work
+proposes to use executable Python code to consolidate LLM agents' actions into
+a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
+can execute code actions and dynamically revise prior actions or emit new
+actions upon new observations through multi-turn interactions. Our extensive
+analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
+CodeAct outperforms widely used alternatives (up to 20% higher success rate).
+The encouraging performance of CodeAct motivates us to build an open-source LLM
+agent that interacts with environments by executing interpretable code and
+collaborates with users using natural language. To this end, we collect an
+instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
+interactions using CodeAct. We show that it can be used with existing data to
+improve models in agent-oriented tasks without compromising their general
+capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
+Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
+model training) using existing libraries and autonomously self-debug.
+10. <b> VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web  Tasks </b>
+ Abstract:   Autonomous agents capable of planning, reasoning, and executing actions on
+the web offer a promising avenue for automating computer tasks. However, the
+majority of existing benchmarks primarily focus on text-based agents,
+neglecting many natural tasks that require visual information to effectively
+solve. Given that most computer interfaces cater to human perception, visual
+information often augments textual data in ways that text-only models struggle
+to harness effectively. To bridge this gap, we introduce VisualWebArena, a
+benchmark designed to assess the performance of multimodal web agents on
+realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set
+of diverse and complex web-based tasks that evaluate various capabilities of
+autonomous multimodal agents. To perform on this benchmark, agents need to
+accurately process image-text inputs, interpret natural language instructions,
+and execute actions on websites to accomplish user-defined objectives. We
+conduct an extensive evaluation of state-of-the-art LLM-based autonomous
+agents, including several multimodal models. Through extensive quantitative and
+qualitative analysis, we identify several limitations of text-only LLM agents,
+and reveal gaps in the capabilities of state-of-the-art multimodal language
+agents. VisualWebArena provides a framework for evaluating multimodal
+autonomous language agents, and offers insights towards building stronger
+autonomous agents for the web.
+""")