burtenshaw commited on
Commit
0fdf3b6
·
0 Parent(s):

initial commit

Browse files
Files changed (5) hide show
  1. .gitignore +5 -0
  2. README.md +108 -0
  3. app.py +92 -0
  4. pyproject.toml +11 -0
  5. requirements.txt +125 -0
.gitignore ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ .env
2
+ .venv
3
+ .python-version
4
+ .vscode
5
+ uv.lock
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Talk to Smolagents
3
+ emoji: 💻
4
+ colorFrom: purple
5
+ colorTo: red
6
+ sdk: gradio
7
+ sdk_version: 5.16.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ short_description: FastRTC Voice Agent with smolagents
12
+ tags: [webrtc, websocket, gradio, secret|HF_TOKEN]
13
+ ---
14
+
15
+ # Voice LLM Agent with Image Generation
16
+
17
+ A voice-enabled AI assistant powered by FastRTC that can:
18
+ 1. Stream audio in real-time using WebRTC
19
+ 2. Listen and respond with natural pauses in conversation
20
+ 3. Generate images based on your requests
21
+ 4. Maintain conversation context across exchanges
22
+
23
+ This app combines the real-time communication capabilities of FastRTC with the powerful agent framework of smolagents.
24
+
25
+ ## Key Features
26
+
27
+ - **Real-time Streaming**: Uses FastRTC's WebRTC-based audio streaming
28
+ - **Voice Activation**: Automatic detection of speech pauses to trigger responses
29
+ - **Multi-modal Interaction**: Combines voice and image generation in a single interface
30
+
31
+ ## Setup
32
+
33
+ 1. Install Python 3.9+ and create a virtual environment:
34
+ ```bash
35
+ python -m venv .venv
36
+ source .venv/bin/activate # On Windows: .venv\Scripts\activate
37
+ ```
38
+
39
+ 2. Install dependencies:
40
+ ```bash
41
+ pip install -r requirements.txt
42
+ ```
43
+
44
+ 3. Create a `.env` file with the following:
45
+ ```
46
+ HF_TOKEN=your_huggingface_api_key
47
+ MODE=UI # Use 'UI' for Gradio interface, leave blank for HTML interface
48
+ ```
49
+
50
+ ## Running the App
51
+
52
+ ### With Gradio UI (Recommended)
53
+
54
+ ```bash
55
+ MODE=UI python app.py
56
+ ```
57
+
58
+ This launches a Gradio UI at http://localhost:7860 with:
59
+ - FastRTC's built-in streaming audio components
60
+ - A chat interface showing the conversation
61
+ - An image display panel for generated images
62
+
63
+ ### With FastAPI (Advanced)
64
+
65
+ ```bash
66
+ python app.py
67
+ ```
68
+
69
+ This serves a FastAPI app at http://localhost:7860 that provides:
70
+ - WebRTC-based audio communication
71
+ - The same smolagents functionality in a more flexible API
72
+
73
+ ## How to Use
74
+
75
+ 1. Click the microphone button to start streaming your voice.
76
+ 2. Speak naturally - the app will automatically detect when you pause.
77
+ 3. Ask the agent to generate an image, for example:
78
+ - "Create an image of a magical forest with glowing mushrooms."
79
+ - "Generate a picture of a futuristic city with flying cars."
80
+ 4. View the generated image and hear the agent's response.
81
+
82
+ ## Technical Architecture
83
+
84
+ ### FastRTC Components
85
+
86
+ - **Stream**: Core component that handles WebRTC connections and audio streaming
87
+ - **ReplyOnPause**: Detects when the user stops speaking to trigger a response
88
+ - **get_stt_model/get_tts_model**: Provides optimized speech-to-text and text-to-speech models
89
+
90
+ ### smolagents Components
91
+
92
+ - **CodeAgent**: Intelligent agent that can use tools based on natural language inputs
93
+ - **Tool.from_space**: Integration with Hugging Face Spaces for image generation
94
+ - **HfApiModel**: Connection to powerful language models for understanding requests
95
+
96
+ ### Integration Flow
97
+
98
+ 1. FastRTC streams and processes audio input in real-time
99
+ 2. Speech is converted to text and passed to the smolagents CodeAgent
100
+ 3. The agent processes the request and calls tools when needed
101
+ 4. Responses and generated images are streamed back through FastRTC
102
+ 5. The UI updates to show both text responses and generated images
103
+
104
+ ## Advanced Features
105
+
106
+ - Conversation history is maintained across exchanges
107
+ - Error handling ensures the app continues working even if agent processing fails
108
+ - The application leverages FastRTC's streaming capabilities for efficient audio transmission
app.py ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+ from typing import List, Dict
3
+
4
+ from dotenv import load_dotenv
5
+ from fastrtc import get_stt_model, get_tts_model, Stream, ReplyOnPause
6
+ from smolagents import CodeAgent, HfApiModel, DuckDuckGoSearchTool
7
+
8
+ # Load environment variables
9
+ load_dotenv()
10
+
11
+ # Initialize file paths
12
+ curr_dir = Path(__file__).parent
13
+
14
+ # Initialize models
15
+ stt_model = get_stt_model()
16
+ tts_model = get_tts_model()
17
+
18
+ # Conversation state to maintain history
19
+ conversation_state: List[Dict[str, str]] = []
20
+
21
+ # System prompt for agent
22
+ system_prompt = """You are a helpful assistant that can helps with finding places to
23
+ workremotely from. You should specifically check against reviews and ratings of the
24
+ place. You should use this criteria to find the best place to work from:
25
+ - Price
26
+ - Reviews
27
+ - Ratings
28
+ - Location
29
+ - WIFI
30
+ Only return the name, address of the place, and a short description of the place.
31
+ Always search for real places.
32
+ Only return real places, not fake ones.
33
+ If you receive anything other than a location, you should ask for a location.
34
+ <example>
35
+ User: I am in Paris, France. Can you find me a place to work from?
36
+ Assistant: I found a place called "Le Café de la Paix" at 123 Rue de la Paix,
37
+ Paris, France. It has good reviews and is in a great location.
38
+ </example>
39
+ <example>
40
+ User: I am in London, UK. Can you find me a place to work from?
41
+ Assistant: I found a place called "The London Coffee Company".
42
+ </example>
43
+ <example>
44
+ User: How many people are in the room?
45
+ Assistant: I only respond to requests about finding places to work from.
46
+ </example>
47
+
48
+ """
49
+
50
+ model = HfApiModel(provider="together", model="Qwen/Qwen2.5-Coder-32B-Instruct")
51
+
52
+ agent = CodeAgent(
53
+ tools=[
54
+ DuckDuckGoSearchTool(),
55
+ ],
56
+ model=model,
57
+ max_steps=10,
58
+ verbosity_level=2,
59
+ description="Search the web for cafes to work from.",
60
+ )
61
+
62
+
63
+ def process_response(audio):
64
+ """Process audio input and generate LLM response with TTS"""
65
+ # Convert speech to text using STT model
66
+ text = stt_model.stt(audio)
67
+ if not text.strip():
68
+ return
69
+
70
+ input_text = f"{system_prompt}\n\n{text}"
71
+ # Get response from agent
72
+ response_content = agent.run(input_text)
73
+
74
+ # Convert response to audio using TTS model
75
+ for audio_chunk in tts_model.stream_tts_sync(response_content or ""):
76
+ # Yield the audio chunk
77
+ yield audio_chunk
78
+
79
+
80
+ stream = Stream(
81
+ handler=ReplyOnPause(process_response, input_sample_rate=16000),
82
+ modality="audio",
83
+ mode="send-receive",
84
+ ui_args={
85
+ "pulse_color": "rgb(255, 255, 255)",
86
+ "icon_button_color": "rgb(255, 255, 255)",
87
+ "title": "🧑‍💻The Coworking Agent",
88
+ },
89
+ )
90
+
91
+ if __name__ == "__main__":
92
+ stream.ui.launch(server_port=7860)
pyproject.toml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "talk-to-sambanova"
3
+ version = "0.1.0"
4
+ description = "Add your description here"
5
+ readme = "README.md"
6
+ requires-python = ">=3.10"
7
+ dependencies = [
8
+ "fastrtc[stt,tts]>=0.0.8.post1",
9
+ "smolagents>=1.9.2",
10
+ ]
11
+
requirements.txt ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # This file was autogenerated by uv via the following command:
2
+ # uv export --format requirements-txt --no-hashes
3
+ aiofiles==23.2.1
4
+ aioice==0.9.0
5
+ aiortc==1.10.1
6
+ annotated-types==0.7.0
7
+ anyio==4.8.0
8
+ attrs==25.1.0
9
+ audioop-lts==0.2.1 ; python_full_version >= '3.13'
10
+ audioread==3.0.1
11
+ av==13.1.0
12
+ babel==2.17.0
13
+ beautifulsoup4==4.13.3
14
+ certifi==2025.1.31
15
+ cffi==1.17.1
16
+ charset-normalizer==3.4.1
17
+ click==8.1.8
18
+ colorama==0.4.6
19
+ coloredlogs==15.0.1
20
+ colorlog==6.9.0
21
+ cryptography==44.0.1
22
+ csvw==3.5.1
23
+ decorator==5.2.1
24
+ dlinfo==2.0.0
25
+ dnspython==2.7.0
26
+ duckduckgo-search==7.5.0
27
+ espeakng-loader==0.2.4
28
+ exceptiongroup==1.2.2 ; python_full_version < '3.11'
29
+ fastapi==0.115.8
30
+ fastrtc==0.0.8.post1
31
+ fastrtc-moonshine-onnx==20241016
32
+ ffmpy==0.5.0
33
+ filelock==3.17.0
34
+ flatbuffers==25.2.10
35
+ fsspec==2025.2.0
36
+ google-crc32c==1.6.0
37
+ gradio==5.19.0
38
+ gradio-client==1.7.2
39
+ h11==0.14.0
40
+ httpcore==1.0.7
41
+ httpx==0.28.1
42
+ huggingface-hub==0.29.1
43
+ humanfriendly==10.0
44
+ idna==3.10
45
+ ifaddr==0.2.0
46
+ isodate==0.7.2
47
+ jinja2==3.1.5
48
+ joblib==1.4.2
49
+ jsonschema==4.23.0
50
+ jsonschema-specifications==2024.10.1
51
+ kokoro-onnx==0.4.3
52
+ language-tags==1.2.0
53
+ lazy-loader==0.4
54
+ librosa==0.10.2.post1
55
+ llvmlite==0.44.0
56
+ lxml==5.3.1
57
+ markdown-it-py==3.0.0
58
+ markdownify==1.0.0
59
+ markupsafe==2.1.5
60
+ mdurl==0.1.2
61
+ mpmath==1.3.0
62
+ msgpack==1.1.0
63
+ numba==0.61.0
64
+ numpy==2.1.3
65
+ onnxruntime==1.20.1
66
+ orjson==3.10.15
67
+ packaging==24.2
68
+ pandas==2.2.3
69
+ phonemizer-fork==3.3.1
70
+ pillow==11.1.0
71
+ platformdirs==4.3.6
72
+ pooch==1.8.2
73
+ primp==0.14.0
74
+ protobuf==5.29.3
75
+ pycparser==2.22
76
+ pydantic==2.10.6
77
+ pydantic-core==2.27.2
78
+ pydub==0.25.1
79
+ pyee==12.1.1
80
+ pygments==2.19.1
81
+ pylibsrtp==0.11.0
82
+ pyopenssl==25.0.0
83
+ pyparsing==3.2.1
84
+ pyreadline3==3.5.4 ; sys_platform == 'win32'
85
+ python-dateutil==2.9.0.post0
86
+ python-dotenv==1.0.1
87
+ python-multipart==0.0.20
88
+ pytz==2025.1
89
+ pyyaml==6.0.2
90
+ rdflib==7.1.3
91
+ referencing==0.36.2
92
+ regex==2024.11.6
93
+ requests==2.32.3
94
+ rfc3986==1.5.0
95
+ rich==13.9.4
96
+ rpds-py==0.23.1
97
+ ruff==0.9.7 ; sys_platform != 'emscripten'
98
+ safehttpx==0.1.6
99
+ scikit-learn==1.6.1
100
+ scipy==1.15.2
101
+ segments==2.3.0
102
+ semantic-version==2.10.0
103
+ shellingham==1.5.4 ; sys_platform != 'emscripten'
104
+ six==1.17.0
105
+ smolagents==1.9.2
106
+ sniffio==1.3.1
107
+ soundfile==0.13.1
108
+ soupsieve==2.6
109
+ soxr==0.5.0.post1
110
+ standard-aifc==3.13.0 ; python_full_version >= '3.13'
111
+ standard-chunk==3.13.0 ; python_full_version >= '3.13'
112
+ standard-sunau==3.13.0 ; python_full_version >= '3.13'
113
+ starlette==0.45.3
114
+ sympy==1.13.3
115
+ threadpoolctl==3.5.0
116
+ tokenizers==0.21.0
117
+ tomlkit==0.13.2
118
+ tqdm==4.67.1
119
+ typer==0.15.1 ; sys_platform != 'emscripten'
120
+ typing-extensions==4.12.2
121
+ tzdata==2025.1
122
+ uritemplate==4.1.1
123
+ urllib3==2.3.0
124
+ uvicorn==0.34.0 ; sys_platform != 'emscripten'
125
+ websockets==15.0