Spaces:
Running
Running
A newer version of the Gradio SDK is available:
5.19.0
metadata
title: Talk to Smolagents
emoji: 💻
colorFrom: purple
colorTo: red
sdk: gradio
sdk_version: 5.16.0
app_file: app.py
pinned: false
license: mit
short_description: FastRTC Voice Agent with smolagents
tags:
- webrtc
- websocket
- gradio
- secret|HF_TOKEN
Voice LLM Agent with Image Generation
A voice-enabled AI assistant powered by FastRTC that can:
- Stream audio in real-time using WebRTC
- Listen and respond with natural pauses in conversation
- Generate images based on your requests
- Maintain conversation context across exchanges
This app combines the real-time communication capabilities of FastRTC with the powerful agent framework of smolagents.
Key Features
- Real-time Streaming: Uses FastRTC's WebRTC-based audio streaming
- Voice Activation: Automatic detection of speech pauses to trigger responses
- Multi-modal Interaction: Combines voice and image generation in a single interface
Setup
Install Python 3.9+ and create a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
Install dependencies:
pip install -r requirements.txt
Create a
.env
file with the following:HF_TOKEN=your_huggingface_api_key MODE=UI # Use 'UI' for Gradio interface, leave blank for HTML interface
Running the App
With Gradio UI (Recommended)
MODE=UI python app.py
This launches a Gradio UI at http://localhost:7860 with:
- FastRTC's built-in streaming audio components
- A chat interface showing the conversation
- An image display panel for generated images
How to Use
- Click the microphone button to start streaming your voice.
- Speak naturally - the app will automatically detect when you pause.
- Ask the agent to generate an image, for example:
- "Create an image of a magical forest with glowing mushrooms."
- "Generate a picture of a futuristic city with flying cars."
- View the generated image and hear the agent's response.
Technical Architecture
FastRTC Components
- Stream: Core component that handles WebRTC connections and audio streaming
- ReplyOnPause: Detects when the user stops speaking to trigger a response
- get_stt_model/get_tts_model: Provides optimized speech-to-text and text-to-speech models
smolagents Components
- CodeAgent: Intelligent agent that can use tools based on natural language inputs
- Tool.from_space: Integration with Hugging Face Spaces for image generation
- HfApiModel: Connection to powerful language models for understanding requests
Integration Flow
- FastRTC streams and processes audio input in real-time
- Speech is converted to text and passed to the smolagents CodeAgent
- The agent processes the request and calls tools when needed
- Responses and generated images are streamed back through FastRTC
- The UI updates to show both text responses and generated images
Advanced Features
- Conversation history is maintained across exchanges
- Error handling ensures the app continues working even if agent processing fails
- The application leverages FastRTC's streaming capabilities for efficient audio transmission