Spaces:
Running
Running
burtenshaw
commited on
Commit
·
0fdf3b6
0
Parent(s):
initial commit
Browse files- .gitignore +5 -0
- README.md +108 -0
- app.py +92 -0
- pyproject.toml +11 -0
- requirements.txt +125 -0
.gitignore
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
.env
|
2 |
+
.venv
|
3 |
+
.python-version
|
4 |
+
.vscode
|
5 |
+
uv.lock
|
README.md
ADDED
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: Talk to Smolagents
|
3 |
+
emoji: 💻
|
4 |
+
colorFrom: purple
|
5 |
+
colorTo: red
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 5.16.0
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
license: mit
|
11 |
+
short_description: FastRTC Voice Agent with smolagents
|
12 |
+
tags: [webrtc, websocket, gradio, secret|HF_TOKEN]
|
13 |
+
---
|
14 |
+
|
15 |
+
# Voice LLM Agent with Image Generation
|
16 |
+
|
17 |
+
A voice-enabled AI assistant powered by FastRTC that can:
|
18 |
+
1. Stream audio in real-time using WebRTC
|
19 |
+
2. Listen and respond with natural pauses in conversation
|
20 |
+
3. Generate images based on your requests
|
21 |
+
4. Maintain conversation context across exchanges
|
22 |
+
|
23 |
+
This app combines the real-time communication capabilities of FastRTC with the powerful agent framework of smolagents.
|
24 |
+
|
25 |
+
## Key Features
|
26 |
+
|
27 |
+
- **Real-time Streaming**: Uses FastRTC's WebRTC-based audio streaming
|
28 |
+
- **Voice Activation**: Automatic detection of speech pauses to trigger responses
|
29 |
+
- **Multi-modal Interaction**: Combines voice and image generation in a single interface
|
30 |
+
|
31 |
+
## Setup
|
32 |
+
|
33 |
+
1. Install Python 3.9+ and create a virtual environment:
|
34 |
+
```bash
|
35 |
+
python -m venv .venv
|
36 |
+
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
37 |
+
```
|
38 |
+
|
39 |
+
2. Install dependencies:
|
40 |
+
```bash
|
41 |
+
pip install -r requirements.txt
|
42 |
+
```
|
43 |
+
|
44 |
+
3. Create a `.env` file with the following:
|
45 |
+
```
|
46 |
+
HF_TOKEN=your_huggingface_api_key
|
47 |
+
MODE=UI # Use 'UI' for Gradio interface, leave blank for HTML interface
|
48 |
+
```
|
49 |
+
|
50 |
+
## Running the App
|
51 |
+
|
52 |
+
### With Gradio UI (Recommended)
|
53 |
+
|
54 |
+
```bash
|
55 |
+
MODE=UI python app.py
|
56 |
+
```
|
57 |
+
|
58 |
+
This launches a Gradio UI at http://localhost:7860 with:
|
59 |
+
- FastRTC's built-in streaming audio components
|
60 |
+
- A chat interface showing the conversation
|
61 |
+
- An image display panel for generated images
|
62 |
+
|
63 |
+
### With FastAPI (Advanced)
|
64 |
+
|
65 |
+
```bash
|
66 |
+
python app.py
|
67 |
+
```
|
68 |
+
|
69 |
+
This serves a FastAPI app at http://localhost:7860 that provides:
|
70 |
+
- WebRTC-based audio communication
|
71 |
+
- The same smolagents functionality in a more flexible API
|
72 |
+
|
73 |
+
## How to Use
|
74 |
+
|
75 |
+
1. Click the microphone button to start streaming your voice.
|
76 |
+
2. Speak naturally - the app will automatically detect when you pause.
|
77 |
+
3. Ask the agent to generate an image, for example:
|
78 |
+
- "Create an image of a magical forest with glowing mushrooms."
|
79 |
+
- "Generate a picture of a futuristic city with flying cars."
|
80 |
+
4. View the generated image and hear the agent's response.
|
81 |
+
|
82 |
+
## Technical Architecture
|
83 |
+
|
84 |
+
### FastRTC Components
|
85 |
+
|
86 |
+
- **Stream**: Core component that handles WebRTC connections and audio streaming
|
87 |
+
- **ReplyOnPause**: Detects when the user stops speaking to trigger a response
|
88 |
+
- **get_stt_model/get_tts_model**: Provides optimized speech-to-text and text-to-speech models
|
89 |
+
|
90 |
+
### smolagents Components
|
91 |
+
|
92 |
+
- **CodeAgent**: Intelligent agent that can use tools based on natural language inputs
|
93 |
+
- **Tool.from_space**: Integration with Hugging Face Spaces for image generation
|
94 |
+
- **HfApiModel**: Connection to powerful language models for understanding requests
|
95 |
+
|
96 |
+
### Integration Flow
|
97 |
+
|
98 |
+
1. FastRTC streams and processes audio input in real-time
|
99 |
+
2. Speech is converted to text and passed to the smolagents CodeAgent
|
100 |
+
3. The agent processes the request and calls tools when needed
|
101 |
+
4. Responses and generated images are streamed back through FastRTC
|
102 |
+
5. The UI updates to show both text responses and generated images
|
103 |
+
|
104 |
+
## Advanced Features
|
105 |
+
|
106 |
+
- Conversation history is maintained across exchanges
|
107 |
+
- Error handling ensures the app continues working even if agent processing fails
|
108 |
+
- The application leverages FastRTC's streaming capabilities for efficient audio transmission
|
app.py
ADDED
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from pathlib import Path
|
2 |
+
from typing import List, Dict
|
3 |
+
|
4 |
+
from dotenv import load_dotenv
|
5 |
+
from fastrtc import get_stt_model, get_tts_model, Stream, ReplyOnPause
|
6 |
+
from smolagents import CodeAgent, HfApiModel, DuckDuckGoSearchTool
|
7 |
+
|
8 |
+
# Load environment variables
|
9 |
+
load_dotenv()
|
10 |
+
|
11 |
+
# Initialize file paths
|
12 |
+
curr_dir = Path(__file__).parent
|
13 |
+
|
14 |
+
# Initialize models
|
15 |
+
stt_model = get_stt_model()
|
16 |
+
tts_model = get_tts_model()
|
17 |
+
|
18 |
+
# Conversation state to maintain history
|
19 |
+
conversation_state: List[Dict[str, str]] = []
|
20 |
+
|
21 |
+
# System prompt for agent
|
22 |
+
system_prompt = """You are a helpful assistant that can helps with finding places to
|
23 |
+
workremotely from. You should specifically check against reviews and ratings of the
|
24 |
+
place. You should use this criteria to find the best place to work from:
|
25 |
+
- Price
|
26 |
+
- Reviews
|
27 |
+
- Ratings
|
28 |
+
- Location
|
29 |
+
- WIFI
|
30 |
+
Only return the name, address of the place, and a short description of the place.
|
31 |
+
Always search for real places.
|
32 |
+
Only return real places, not fake ones.
|
33 |
+
If you receive anything other than a location, you should ask for a location.
|
34 |
+
<example>
|
35 |
+
User: I am in Paris, France. Can you find me a place to work from?
|
36 |
+
Assistant: I found a place called "Le Café de la Paix" at 123 Rue de la Paix,
|
37 |
+
Paris, France. It has good reviews and is in a great location.
|
38 |
+
</example>
|
39 |
+
<example>
|
40 |
+
User: I am in London, UK. Can you find me a place to work from?
|
41 |
+
Assistant: I found a place called "The London Coffee Company".
|
42 |
+
</example>
|
43 |
+
<example>
|
44 |
+
User: How many people are in the room?
|
45 |
+
Assistant: I only respond to requests about finding places to work from.
|
46 |
+
</example>
|
47 |
+
|
48 |
+
"""
|
49 |
+
|
50 |
+
model = HfApiModel(provider="together", model="Qwen/Qwen2.5-Coder-32B-Instruct")
|
51 |
+
|
52 |
+
agent = CodeAgent(
|
53 |
+
tools=[
|
54 |
+
DuckDuckGoSearchTool(),
|
55 |
+
],
|
56 |
+
model=model,
|
57 |
+
max_steps=10,
|
58 |
+
verbosity_level=2,
|
59 |
+
description="Search the web for cafes to work from.",
|
60 |
+
)
|
61 |
+
|
62 |
+
|
63 |
+
def process_response(audio):
|
64 |
+
"""Process audio input and generate LLM response with TTS"""
|
65 |
+
# Convert speech to text using STT model
|
66 |
+
text = stt_model.stt(audio)
|
67 |
+
if not text.strip():
|
68 |
+
return
|
69 |
+
|
70 |
+
input_text = f"{system_prompt}\n\n{text}"
|
71 |
+
# Get response from agent
|
72 |
+
response_content = agent.run(input_text)
|
73 |
+
|
74 |
+
# Convert response to audio using TTS model
|
75 |
+
for audio_chunk in tts_model.stream_tts_sync(response_content or ""):
|
76 |
+
# Yield the audio chunk
|
77 |
+
yield audio_chunk
|
78 |
+
|
79 |
+
|
80 |
+
stream = Stream(
|
81 |
+
handler=ReplyOnPause(process_response, input_sample_rate=16000),
|
82 |
+
modality="audio",
|
83 |
+
mode="send-receive",
|
84 |
+
ui_args={
|
85 |
+
"pulse_color": "rgb(255, 255, 255)",
|
86 |
+
"icon_button_color": "rgb(255, 255, 255)",
|
87 |
+
"title": "🧑💻The Coworking Agent",
|
88 |
+
},
|
89 |
+
)
|
90 |
+
|
91 |
+
if __name__ == "__main__":
|
92 |
+
stream.ui.launch(server_port=7860)
|
pyproject.toml
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[project]
|
2 |
+
name = "talk-to-sambanova"
|
3 |
+
version = "0.1.0"
|
4 |
+
description = "Add your description here"
|
5 |
+
readme = "README.md"
|
6 |
+
requires-python = ">=3.10"
|
7 |
+
dependencies = [
|
8 |
+
"fastrtc[stt,tts]>=0.0.8.post1",
|
9 |
+
"smolagents>=1.9.2",
|
10 |
+
]
|
11 |
+
|
requirements.txt
ADDED
@@ -0,0 +1,125 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# This file was autogenerated by uv via the following command:
|
2 |
+
# uv export --format requirements-txt --no-hashes
|
3 |
+
aiofiles==23.2.1
|
4 |
+
aioice==0.9.0
|
5 |
+
aiortc==1.10.1
|
6 |
+
annotated-types==0.7.0
|
7 |
+
anyio==4.8.0
|
8 |
+
attrs==25.1.0
|
9 |
+
audioop-lts==0.2.1 ; python_full_version >= '3.13'
|
10 |
+
audioread==3.0.1
|
11 |
+
av==13.1.0
|
12 |
+
babel==2.17.0
|
13 |
+
beautifulsoup4==4.13.3
|
14 |
+
certifi==2025.1.31
|
15 |
+
cffi==1.17.1
|
16 |
+
charset-normalizer==3.4.1
|
17 |
+
click==8.1.8
|
18 |
+
colorama==0.4.6
|
19 |
+
coloredlogs==15.0.1
|
20 |
+
colorlog==6.9.0
|
21 |
+
cryptography==44.0.1
|
22 |
+
csvw==3.5.1
|
23 |
+
decorator==5.2.1
|
24 |
+
dlinfo==2.0.0
|
25 |
+
dnspython==2.7.0
|
26 |
+
duckduckgo-search==7.5.0
|
27 |
+
espeakng-loader==0.2.4
|
28 |
+
exceptiongroup==1.2.2 ; python_full_version < '3.11'
|
29 |
+
fastapi==0.115.8
|
30 |
+
fastrtc==0.0.8.post1
|
31 |
+
fastrtc-moonshine-onnx==20241016
|
32 |
+
ffmpy==0.5.0
|
33 |
+
filelock==3.17.0
|
34 |
+
flatbuffers==25.2.10
|
35 |
+
fsspec==2025.2.0
|
36 |
+
google-crc32c==1.6.0
|
37 |
+
gradio==5.19.0
|
38 |
+
gradio-client==1.7.2
|
39 |
+
h11==0.14.0
|
40 |
+
httpcore==1.0.7
|
41 |
+
httpx==0.28.1
|
42 |
+
huggingface-hub==0.29.1
|
43 |
+
humanfriendly==10.0
|
44 |
+
idna==3.10
|
45 |
+
ifaddr==0.2.0
|
46 |
+
isodate==0.7.2
|
47 |
+
jinja2==3.1.5
|
48 |
+
joblib==1.4.2
|
49 |
+
jsonschema==4.23.0
|
50 |
+
jsonschema-specifications==2024.10.1
|
51 |
+
kokoro-onnx==0.4.3
|
52 |
+
language-tags==1.2.0
|
53 |
+
lazy-loader==0.4
|
54 |
+
librosa==0.10.2.post1
|
55 |
+
llvmlite==0.44.0
|
56 |
+
lxml==5.3.1
|
57 |
+
markdown-it-py==3.0.0
|
58 |
+
markdownify==1.0.0
|
59 |
+
markupsafe==2.1.5
|
60 |
+
mdurl==0.1.2
|
61 |
+
mpmath==1.3.0
|
62 |
+
msgpack==1.1.0
|
63 |
+
numba==0.61.0
|
64 |
+
numpy==2.1.3
|
65 |
+
onnxruntime==1.20.1
|
66 |
+
orjson==3.10.15
|
67 |
+
packaging==24.2
|
68 |
+
pandas==2.2.3
|
69 |
+
phonemizer-fork==3.3.1
|
70 |
+
pillow==11.1.0
|
71 |
+
platformdirs==4.3.6
|
72 |
+
pooch==1.8.2
|
73 |
+
primp==0.14.0
|
74 |
+
protobuf==5.29.3
|
75 |
+
pycparser==2.22
|
76 |
+
pydantic==2.10.6
|
77 |
+
pydantic-core==2.27.2
|
78 |
+
pydub==0.25.1
|
79 |
+
pyee==12.1.1
|
80 |
+
pygments==2.19.1
|
81 |
+
pylibsrtp==0.11.0
|
82 |
+
pyopenssl==25.0.0
|
83 |
+
pyparsing==3.2.1
|
84 |
+
pyreadline3==3.5.4 ; sys_platform == 'win32'
|
85 |
+
python-dateutil==2.9.0.post0
|
86 |
+
python-dotenv==1.0.1
|
87 |
+
python-multipart==0.0.20
|
88 |
+
pytz==2025.1
|
89 |
+
pyyaml==6.0.2
|
90 |
+
rdflib==7.1.3
|
91 |
+
referencing==0.36.2
|
92 |
+
regex==2024.11.6
|
93 |
+
requests==2.32.3
|
94 |
+
rfc3986==1.5.0
|
95 |
+
rich==13.9.4
|
96 |
+
rpds-py==0.23.1
|
97 |
+
ruff==0.9.7 ; sys_platform != 'emscripten'
|
98 |
+
safehttpx==0.1.6
|
99 |
+
scikit-learn==1.6.1
|
100 |
+
scipy==1.15.2
|
101 |
+
segments==2.3.0
|
102 |
+
semantic-version==2.10.0
|
103 |
+
shellingham==1.5.4 ; sys_platform != 'emscripten'
|
104 |
+
six==1.17.0
|
105 |
+
smolagents==1.9.2
|
106 |
+
sniffio==1.3.1
|
107 |
+
soundfile==0.13.1
|
108 |
+
soupsieve==2.6
|
109 |
+
soxr==0.5.0.post1
|
110 |
+
standard-aifc==3.13.0 ; python_full_version >= '3.13'
|
111 |
+
standard-chunk==3.13.0 ; python_full_version >= '3.13'
|
112 |
+
standard-sunau==3.13.0 ; python_full_version >= '3.13'
|
113 |
+
starlette==0.45.3
|
114 |
+
sympy==1.13.3
|
115 |
+
threadpoolctl==3.5.0
|
116 |
+
tokenizers==0.21.0
|
117 |
+
tomlkit==0.13.2
|
118 |
+
tqdm==4.67.1
|
119 |
+
typer==0.15.1 ; sys_platform != 'emscripten'
|
120 |
+
typing-extensions==4.12.2
|
121 |
+
tzdata==2025.1
|
122 |
+
uritemplate==4.1.1
|
123 |
+
urllib3==2.3.0
|
124 |
+
uvicorn==0.34.0 ; sys_platform != 'emscripten'
|
125 |
+
websockets==15.0
|