Aliaksandr karapuz Maksim Liutisch commited on
Commit
f655f69
Β·
unverified Β·
1 Parent(s): e59e248

merge dev into main (#13)

Browse files

* add description to readme

* sound effects v2 (#15)

* use TTS with timestamps
* ask LLM to place sound effects in arbitrary place in the text
* lots of refactoring

* Improve sound effect prompt (#21)

Improve sound effects generation:
* upd prompt
* add fade-in and fade-out
* make effects quiter
* make effects at least 1 second long
* bugfix in effects regex

* Feature/emotions checking (#23)

- udpate text preprocessing for TTS
- add left and right text contexts to TTS model
- upd TTS params selection
- upd effects prompt
- upd OPENAI_MAX_PARALLEL

* Visualization (#22)

update visualizations

---------

Co-authored-by: Skidan Olya <[email protected]>
Co-authored-by: Maksim Liutisch <[email protected]>

.gitignore CHANGED
@@ -4,7 +4,6 @@ venv
4
  .python-version
5
  .DS_Store
6
 
7
- data/books
8
- data/audiobooks
9
 
10
  .env
 
4
  .python-version
5
  .DS_Store
6
 
7
+ data/**/
 
8
 
9
  .env
README.md CHANGED
@@ -10,22 +10,8 @@ pinned: false
10
  python_version: 3.11
11
  ---
12
 
13
- ### Action Items / Ideas
14
 
15
- - intonations
16
- - add context
17
- - audio effects
18
- - add context
19
- - filter, apply only for long phrases
20
- - improve UI
21
- - show character parts
22
- - testing
23
- - eval current execution time
24
- - optimizations
25
- - combine sequential phrases of same character in single phrase
26
- - support large texts. use batching. problem: how to ensure same characters?
27
- - can detect characters in first prompt, then split text in each batch into character phrases
28
- - probably split large phrases into smaller ones
29
- - identify unknown characters
30
- - use LLM to recognize characters for a given text and provide descriptions detailed enough to select appropriate voice
31
 
 
 
10
  python_version: 3.11
11
  ---
12
 
13
+ ## Description
14
 
15
+ Automatically generate audiobooks from the text input. Automatically detect characters and map them to appropriate voices. Use text-to-speech models combined with text-to-audio-effect models to create an immersive listening experience
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
+ This project focuses on the automatic generation of audiobooks from text input, offering an immersive experience. The system intelligently detects characters in the text and assigns them distinct, appropriate voices using Large Language Model. To enhance the auditory experience, the project incorporates text-to-audio-effect models, adding relevant background sounds and audio effects that match the context of the narrative. The combination of natural-sounding speech synthesis and environmental sound design creates a rich, engaging audiobook experience that adapts seamlessly to different genres and styles of writing, making the storytelling more vivid and captivating for listeners.
app.py CHANGED
@@ -1,6 +1,5 @@
1
  import os
2
  from pathlib import Path
3
- from typing import List
4
 
5
  import gradio as gr
6
  from dotenv import load_dotenv
@@ -8,9 +7,11 @@ from langchain_community.document_loaders import PyPDFLoader
8
 
9
  load_dotenv()
10
 
11
- from src.builder import AudiobookBuilder
12
- from src.config import logger, FILE_SIZE_MAX, MAX_TEXT_LEN, DESCRIPTION
13
  from data import samples_to_split as samples
 
 
 
 
14
 
15
 
16
  def get_auth_params():
@@ -27,13 +28,10 @@ def parse_pdf(file_path):
27
 
28
 
29
  def load_text_from_file(uploaded_file):
30
- # Save the uploaded file temporarily to check its size
31
  temp_file_path = uploaded_file.name
32
 
33
  if os.path.getsize(temp_file_path) > FILE_SIZE_MAX * 1024 * 1024:
34
- raise ValueError(
35
- f"The uploaded file exceeds the size limit of {FILE_SIZE_MAX} MB."
36
- )
37
 
38
  if uploaded_file.name.endswith(".txt"):
39
  with open(temp_file_path, "r", encoding="utf-8") as file:
@@ -46,45 +44,58 @@ def load_text_from_file(uploaded_file):
46
  return text
47
 
48
 
49
- async def respond(
50
  text: str,
51
  uploaded_file,
52
  generate_effects: bool,
53
- ) -> tuple[Path | None, str]:
 
 
 
 
54
  if uploaded_file is not None:
55
  try:
56
  text = load_text_from_file(uploaded_file=uploaded_file)
57
  except Exception as e:
58
  logger.exception(e)
59
- return (None, str(e))
 
 
 
 
 
 
 
 
 
 
60
 
61
  if (text_len := len(text)) > MAX_TEXT_LEN:
62
- gr.Warning(
63
  f"Input text length of {text_len} characters "
64
  f"exceeded current limit of {MAX_TEXT_LEN} characters. "
65
  "Please input a shorter text."
66
  )
67
- return None, ""
68
-
69
- builder = AudiobookBuilder()
70
- audio_fp = await builder.run(text=text, generate_effects=generate_effects)
71
 
72
- return audio_fp, ""
 
73
 
74
 
75
  def refresh():
76
- return None, None, None # Reset audio output, error message, and uploaded file
77
 
78
 
79
- with gr.Blocks(title="Audiobooks Generation") as ui:
80
- gr.Markdown(DESCRIPTION)
81
-
82
  with gr.Row(variant="panel"):
83
  text_input = gr.Textbox(label="Enter the book text here", lines=15)
84
  file_input = gr.File(
85
  label="Upload a text file or PDF",
86
  file_types=[".txt", ".pdf"],
87
- visible=False,
88
  )
89
 
90
  examples = gr.Examples(
@@ -104,33 +115,49 @@ with gr.Blocks(title="Audiobooks Generation") as ui:
104
  ],
105
  )
106
 
107
- audio_output = gr.Audio(
108
- label='Generated audio. Please wait for the waveform to appear, before hitting "Play"',
109
- type="filepath",
110
- )
111
- # error output is hidden initially
112
  error_output = gr.Textbox(label="Error Message", interactive=False, visible=False)
113
 
114
  effects_generation_checkbox = gr.Checkbox(
115
- label="Add background effects",
116
  value=False,
117
  info="Select if you want to add occasional sound effect to the audiobook",
118
  )
119
 
 
 
 
 
 
 
 
 
120
  with gr.Row(variant="panel"):
121
- submit_button = gr.Button("Generate the audiobook", variant="primary")
122
  refresh_button = gr.Button("Refresh", variant="secondary")
123
 
 
 
 
 
 
 
 
 
 
 
124
  submit_button.click(
125
- fn=respond,
126
  inputs=[
127
  text_input,
128
  file_input,
129
  effects_generation_checkbox,
 
 
130
  ], # Include the uploaded file as an input
131
  outputs=[
132
  audio_output,
133
  error_output,
 
134
  ], # Include the audio output and error message output
135
  )
136
  refresh_button.click(
@@ -142,21 +169,16 @@ with gr.Blocks(title="Audiobooks Generation") as ui:
142
  file_input,
143
  ], # Reset audio output, error message, and uploaded file
144
  )
145
-
146
- # Hide error message dynamically when input is received
147
  text_input.change(
148
  fn=lambda _: gr.update(visible=False), # Hide the error field
149
  inputs=[text_input],
150
  outputs=error_output,
151
  )
152
-
153
  file_input.change(
154
  fn=lambda _: gr.update(visible=False), # Hide the error field
155
  inputs=[file_input],
156
  outputs=error_output,
157
  )
158
-
159
- # To clear error field when refreshing
160
  refresh_button.click(
161
  fn=lambda _: gr.update(visible=False), # Hide the error field
162
  inputs=[],
 
1
  import os
2
  from pathlib import Path
 
3
 
4
  import gradio as gr
5
  from dotenv import load_dotenv
 
7
 
8
  load_dotenv()
9
 
 
 
10
  from data import samples_to_split as samples
11
+ from src.builder import AudiobookBuilder
12
+ from src.config import FILE_SIZE_MAX, MAX_TEXT_LEN, logger
13
+ from src.web.utils import create_status_html
14
+ from src.web.variables import DESCRIPTION_JS, GRADIO_THEME, STATUS_DISPLAY_HTML, VOICE_UPLOAD_JS
15
 
16
 
17
  def get_auth_params():
 
28
 
29
 
30
  def load_text_from_file(uploaded_file):
 
31
  temp_file_path = uploaded_file.name
32
 
33
  if os.path.getsize(temp_file_path) > FILE_SIZE_MAX * 1024 * 1024:
34
+ raise ValueError(f"The uploaded file exceeds the size limit of {FILE_SIZE_MAX} MB.")
 
 
35
 
36
  if uploaded_file.name.endswith(".txt"):
37
  with open(temp_file_path, "r", encoding="utf-8") as file:
 
44
  return text
45
 
46
 
47
+ async def audiobook_builder(
48
  text: str,
49
  uploaded_file,
50
  generate_effects: bool,
51
+ use_user_voice: bool,
52
+ voice_id: str | None = None,
53
+ ):
54
+ builder = AudiobookBuilder()
55
+
56
  if uploaded_file is not None:
57
  try:
58
  text = load_text_from_file(uploaded_file=uploaded_file)
59
  except Exception as e:
60
  logger.exception(e)
61
+ msg = "Failed to load text from the provided document"
62
+ gr.Warning(msg)
63
+ yield None, str(e), builder.html_generator.generate_error(msg)
64
+ return
65
+
66
+ if not text:
67
+ logger.info(f"No text was passed. can't generate an audiobook")
68
+ msg = 'Please provide the text to generate audiobook from'
69
+ gr.Warning(msg)
70
+ yield None, "", builder.html_generator.generate_error(msg)
71
+ return
72
 
73
  if (text_len := len(text)) > MAX_TEXT_LEN:
74
+ msg = (
75
  f"Input text length of {text_len} characters "
76
  f"exceeded current limit of {MAX_TEXT_LEN} characters. "
77
  "Please input a shorter text."
78
  )
79
+ logger.info(msg)
80
+ gr.Warning(msg)
81
+ yield None, "", builder.html_generator.generate_error(msg)
82
+ return
83
 
84
+ async for stage in builder.run(text, generate_effects, use_user_voice, voice_id):
85
+ yield stage
86
 
87
 
88
  def refresh():
89
+ return None, None, None, STATUS_DISPLAY_HTML
90
 
91
 
92
+ with gr.Blocks(js=DESCRIPTION_JS, theme=GRADIO_THEME) as ui:
 
 
93
  with gr.Row(variant="panel"):
94
  text_input = gr.Textbox(label="Enter the book text here", lines=15)
95
  file_input = gr.File(
96
  label="Upload a text file or PDF",
97
  file_types=[".txt", ".pdf"],
98
+ visible=True,
99
  )
100
 
101
  examples = gr.Examples(
 
115
  ],
116
  )
117
 
 
 
 
 
 
118
  error_output = gr.Textbox(label="Error Message", interactive=False, visible=False)
119
 
120
  effects_generation_checkbox = gr.Checkbox(
121
+ label="Add sound effects",
122
  value=False,
123
  info="Select if you want to add occasional sound effect to the audiobook",
124
  )
125
 
126
+ use_voice_checkbox = gr.Checkbox(
127
+ label="Use my voice",
128
+ value=False,
129
+ info="Select if you want to use your voice for whole or part of the audiobook (Generations may take longer than usual)",
130
+ )
131
+
132
+ submit_button = gr.Button("Generate the audiobook", variant="primary")
133
+
134
  with gr.Row(variant="panel"):
135
+ add_voice_btn = gr.Button("Add my voice", variant="primary")
136
  refresh_button = gr.Button("Refresh", variant="secondary")
137
 
138
+ voice_result = gr.Textbox(visible=False, interactive=False, label="Processed Result")
139
+ status_display = gr.HTML(value=STATUS_DISPLAY_HTML, label="Generation Status")
140
+ audio_output = gr.Audio(
141
+ label='Generated audio. Please wait for the waveform to appear, before hitting "Play"',
142
+ type="filepath",
143
+ )
144
+
145
+ # callbacks
146
+
147
+ add_voice_btn.click(fn=None, inputs=None, outputs=voice_result, js=VOICE_UPLOAD_JS)
148
  submit_button.click(
149
+ fn=audiobook_builder,
150
  inputs=[
151
  text_input,
152
  file_input,
153
  effects_generation_checkbox,
154
+ use_voice_checkbox,
155
+ voice_result,
156
  ], # Include the uploaded file as an input
157
  outputs=[
158
  audio_output,
159
  error_output,
160
+ status_display,
161
  ], # Include the audio output and error message output
162
  )
163
  refresh_button.click(
 
169
  file_input,
170
  ], # Reset audio output, error message, and uploaded file
171
  )
 
 
172
  text_input.change(
173
  fn=lambda _: gr.update(visible=False), # Hide the error field
174
  inputs=[text_input],
175
  outputs=error_output,
176
  )
 
177
  file_input.change(
178
  fn=lambda _: gr.update(visible=False), # Hide the error field
179
  inputs=[file_input],
180
  outputs=error_output,
181
  )
 
 
182
  refresh_button.click(
183
  fn=lambda _: gr.update(visible=False), # Hide the error field
184
  inputs=[],
data/11labs_available_tts_voices.reviewed.csv CHANGED
@@ -19,7 +19,7 @@ teAOBFSeynXfbyNgq6Ec,Ally - Curious and Chill,https://storage.googleapis.com/ele
19
  IKne3meq5aSn9XLyUdCD,Charlie,https://storage.googleapis.com/eleven-public-prod/premade/voices/IKne3meq5aSn9XLyUdCD/102de6f2-22ed-43e0-a1f1-111fa75c5481.mp3,ok,,,FALSE,FALSE,australian,natural,middle_aged,male,conversational,,
20
  cjVigY5qzO86Huf0OWal,Eric,https://storage.googleapis.com/eleven-public-prod/premade/voices/cjVigY5qzO86Huf0OWal/d098fda0-6456-4030-b3d8-63aa048c9070.mp3,medium,,,FALSE,FALSE,american,friendly,middle_aged,male,conversational,,
21
  BFUk567oZITYKwOqegEq,Riley - loud and intense,https://storage.googleapis.com/eleven-public-prod/UwDtqCF44YaL77wxb8DVQlHT5Gp1/voices/60G0VdAP3WBQQbE6tSkT/ecc00def-2543-4b50-b93d-5d4b6c7dca33.mp3,very bad,,admin,FALSE,FALSE,american,,middle_aged,male,conversational,,intense
22
- EkuRA6XL9UbflTWEtNbQ,Middle age Southern Male,https://storage.googleapis.com/eleven-public-prod/0gh9bWjaVmNOvQJVcRddxeYIS2z1/voices/t5Oo3tZSuEZt6BD2VGV4/5c0177c5-46bd-414c-abfd-6cd6d5677f08.mp3,medium,,admin,FALSE,FALSE,american,,middle_aged,male,conversational,,casual
23
  MP7UPhn7eVWqCGJGIh6Q,Aaron Patrick - Fun-Upbeat,https://storage.googleapis.com/eleven-public-prod/database/user/ktIm5hvnGlc2TVlwOiZmbmw9kHy2/voices/MP7UPhn7eVWqCGJGIh6Q/NFiMZncqQJ0IFTzFGbwQ.mp3,ok,,admin,FALSE,FALSE,american,,middle_aged,male,conversational,en,upbeat
24
  RPEIZnKMqlQiZyZd1Dae,Christopher - friendly guy next door,https://storage.googleapis.com/eleven-public-prod/database/user/HURZYaLa4shZEqiT75qd5tyEsSr1/voices/RPEIZnKMqlQiZyZd1Dae/FwLtZ4mCBHV0eLjbUM8Y.mp3,ok,,admin,FALSE,FALSE,american,,middle_aged,male,conversational,en,casual
25
  Tx7VLgfksXHVnoY6jDGU,"Conversational Joe - A chatty casual voice, British RP male",https://storage.googleapis.com/eleven-public-prod/database/user/wf6Rmje05ZbqeHYfK82ThsPKouC2/voices/Tx7VLgfksXHVnoY6jDGU/ab4X4F9RcNSeTwBS8KS9.mp3,ok,,admin,FALSE,FALSE,british,,middle_aged,male,conversational,en,casual
 
19
  IKne3meq5aSn9XLyUdCD,Charlie,https://storage.googleapis.com/eleven-public-prod/premade/voices/IKne3meq5aSn9XLyUdCD/102de6f2-22ed-43e0-a1f1-111fa75c5481.mp3,ok,,,FALSE,FALSE,australian,natural,middle_aged,male,conversational,,
20
  cjVigY5qzO86Huf0OWal,Eric,https://storage.googleapis.com/eleven-public-prod/premade/voices/cjVigY5qzO86Huf0OWal/d098fda0-6456-4030-b3d8-63aa048c9070.mp3,medium,,,FALSE,FALSE,american,friendly,middle_aged,male,conversational,,
21
  BFUk567oZITYKwOqegEq,Riley - loud and intense,https://storage.googleapis.com/eleven-public-prod/UwDtqCF44YaL77wxb8DVQlHT5Gp1/voices/60G0VdAP3WBQQbE6tSkT/ecc00def-2543-4b50-b93d-5d4b6c7dca33.mp3,very bad,,admin,FALSE,FALSE,american,,middle_aged,male,conversational,,intense
22
+ EkuRA6XL9UbflTWEtNbQ,Middle age Southern Male,https://storage.googleapis.com/eleven-public-prod/0gh9bWjaVmNOvQJVcRddxeYIS2z1/voices/t5Oo3tZSuEZt6BD2VGV4/5c0177c5-46bd-414c-abfd-6cd6d5677f08.mp3,bad,,admin,FALSE,FALSE,american,,middle_aged,male,conversational,,casual
23
  MP7UPhn7eVWqCGJGIh6Q,Aaron Patrick - Fun-Upbeat,https://storage.googleapis.com/eleven-public-prod/database/user/ktIm5hvnGlc2TVlwOiZmbmw9kHy2/voices/MP7UPhn7eVWqCGJGIh6Q/NFiMZncqQJ0IFTzFGbwQ.mp3,ok,,admin,FALSE,FALSE,american,,middle_aged,male,conversational,en,upbeat
24
  RPEIZnKMqlQiZyZd1Dae,Christopher - friendly guy next door,https://storage.googleapis.com/eleven-public-prod/database/user/HURZYaLa4shZEqiT75qd5tyEsSr1/voices/RPEIZnKMqlQiZyZd1Dae/FwLtZ4mCBHV0eLjbUM8Y.mp3,ok,,admin,FALSE,FALSE,american,,middle_aged,male,conversational,en,casual
25
  Tx7VLgfksXHVnoY6jDGU,"Conversational Joe - A chatty casual voice, British RP male",https://storage.googleapis.com/eleven-public-prod/database/user/wf6Rmje05ZbqeHYfK82ThsPKouC2/voices/Tx7VLgfksXHVnoY6jDGU/ab4X4F9RcNSeTwBS8KS9.mp3,ok,,admin,FALSE,FALSE,british,,middle_aged,male,conversational,en,casual
reviewed_voices.xlsx β†’ data/reviewed_voices.xlsx RENAMED
File without changes
voices_to_consider.xlsx β†’ data/voices_to_consider.xlsx RENAMED
File without changes
makefile ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ # install python dependencies in current environment
2
+ install:
3
+ pip install -r requirements.txt
4
+
5
+ # format python files
6
+ format:
7
+ black .
8
+ isort .
pg.ipynb β†’ notebooks/eda_voices.ipynb RENAMED
@@ -1,5 +1,21 @@
1
  {
2
  "cells": [
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  {
4
  "cell_type": "code",
5
  "execution_count": 1,
@@ -12,29 +28,12 @@
12
  },
13
  {
14
  "cell_type": "code",
15
- "execution_count": 2,
16
  "metadata": {},
17
  "outputs": [],
18
  "source": [
19
- "import os\n",
20
- "\n",
21
  "import dotenv\n",
22
- "import pandas as pd\n",
23
- "from httpx import Timeout\n",
24
- "from pydantic import BaseModel\n",
25
- "from langchain_core.prompts import (\n",
26
- " ChatPromptTemplate,\n",
27
- " SystemMessagePromptTemplate,\n",
28
- " HumanMessagePromptTemplate,\n",
29
- ")\n",
30
- "from langchain_openai import ChatOpenAI\n",
31
- "from langchain_community.callbacks import get_openai_callback\n",
32
- "\n",
33
- "import data.samples_to_split as samples\n",
34
- "\n",
35
- "from src.lc_callbacks import LCMessageLoggerAsync\n",
36
- "from src.utils import GPTModels\n",
37
- "from src.text_split_chain import create_split_text_chain"
38
  ]
39
  },
40
  {
@@ -830,268 +829,6 @@
830
  "outputs": [],
831
  "source": []
832
  },
833
- {
834
- "cell_type": "markdown",
835
- "metadata": {},
836
- "source": [
837
- "## split text into character phrases"
838
- ]
839
- },
840
- {
841
- "cell_type": "code",
842
- "execution_count": 4,
843
- "metadata": {},
844
- "outputs": [
845
- {
846
- "name": "stderr",
847
- "output_type": "stream",
848
- "text": [
849
- "2024-10-10 02:34:52,755 [INFO] audio-books (lc_callbacks.py): call to <failed to determine LLM> with 2 messages:\n",
850
- "{'role': 'system', 'content': 'you are provided with the book sample.\\nplease rewrite it and insert xml tags indicating character to whom current phrase belongs.\\nfor example: <narrator>I looked at her</narrator><Jill>What are you looking at?</Jill>\\n\\nNotes:\\n- sometimes narrator is one of characters taking part in the action.\\nin this case use narrator\\'s name (if available) instead of \"narrator\"\\n- if it\\'s impossible to identify character name from the text provided, use codes \"c1\", \"c2\", etc,\\nwhere \"c\" prefix means character and number is used to enumerate unknown characters\\n- all quotes of direct speech must be attributed to characters, for example:\\n<Tom>β€œShe’s a nice girl,”</Tom><narrator>said Tom after a moment.</narrator>\\nmind that sometimes narrator could also be a character.\\n- use ALL available context to determine the character.\\nsometimes the character name becomes clear from the following phrases\\n- DO NOT include in your response anything except for the original text with character xml tags!!!\\n'}\n",
851
- "{'role': 'human', 'content': 'Here is the book sample:\\n---\\nInside, the crimson room bloomed with light. Tom and Miss Baker sat at\\neither end of the long couch and she read aloud to him from the\\nSaturday Evening Postβ€”the words, murmurous and uninflected, running\\ntogether in a soothing tune. The lamplight, bright on his boots and\\ndull on the autumn-leaf yellow of her hair, glinted along the paper as\\nshe turned a page with a flutter of slender muscles in her arms.\\n\\nWhen we came in she held us silent for a moment with a lifted hand.\\n\\nβ€œTo be continued,” she said, tossing the magazine on the table, β€œin\\nour very next issue.”\\n\\nHer body asserted itself with a restless movement of her knee, and she\\nstood up.\\n\\nβ€œTen o’clock,” she remarked, apparently finding the time on the\\nceiling. β€œTime for this good girl to go to bed.”\\n\\nβ€œJordan’s going to play in the tournament tomorrow,” explained Daisy,\\nβ€œover at Westchester.”\\n\\nβ€œOhβ€”you’re Jordan Baker.”\\n\\nI knew now why her face was familiarβ€”its pleasing contemptuous\\nexpression had looked out at me from many rotogravure pictures of the\\nsporting life at Asheville and Hot Springs and Palm Beach. I had heard\\nsome story of her too, a critical, unpleasant story, but what it was I\\nhad forgotten long ago.\\n\\nβ€œGood night,” she said softly. β€œWake me at eight, won’t you.”\\n\\nβ€œIf you’ll get up.”\\n\\nβ€œI will. Good night, Mr. Carraway. See you anon.”\\n\\nβ€œOf course you will,” confirmed Daisy. β€œIn fact I think I’ll arrange a\\nmarriage. Come over often, Nick, and I’ll sort ofβ€”ohβ€”fling you\\ntogether. You knowβ€”lock you up accidentally in linen closets and push\\nyou out to sea in a boat, and all that sort of thing—”\\n\\nβ€œGood night,” called Miss Baker from the stairs. β€œI haven’t heard a\\nword.”\\n\\nβ€œShe’s a nice girl,” said Tom after a moment. β€œThey oughtn’t to let\\nher run around the country this way.”\\n\\nβ€œWho oughtn’t to?” inquired Daisy coldly.\\n\\nβ€œHer family.”\\n\\nβ€œHer family is one aunt about a thousand years old. Besides, Nick’s\\ngoing to look after her, aren’t you, Nick? She’s going to spend lots\\nof weekends out here this summer. I think the home influence will be\\nvery good for her.”\\n\\nDaisy and Tom looked at each other for a moment in silence.\\n\\nβ€œIs she from New York?” I asked quickly.\\n\\nβ€œFrom Louisville. Our white girlhood was passed together there. Our\\nbeautiful white—”\\n\\nβ€œDid you give Nick a little heart to heart talk on the veranda?”\\ndemanded Tom suddenly.\\n\\nβ€œDid I?” She looked at me. β€œI can’t seem to remember, but I think we\\ntalked about the Nordic race. Yes, I’m sure we did. It sort of crept\\nup on us and first thing you know—”\\n\\nβ€œDon’t believe everything you hear, Nick,” he advised me.\\n'}\n",
852
- "2024-10-10 02:35:04,369 [INFO] httpx (_client.py): HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
853
- "2024-10-10 02:35:04,383 [INFO] audio-books (lc_callbacks.py): raw LLM response: \"<narrator>Inside, the crimson room bloomed with light. Tom and Miss Baker sat at either end of the long couch and she read aloud to him from the Saturday Evening Postβ€”the words, murmurous and uninflected, running together in a soothing tune. The lamplight, bright on his boots and dull on the autumn-leaf yellow of her hair, glinted along the paper as she turned a page with a flutter of slender muscles in her arms.</narrator>\n",
854
- "\n",
855
- "<narrator>When we came in she held us silent for a moment with a lifted hand.</narrator>\n",
856
- "\n",
857
- "<Jordan>β€œTo be continued,”</Jordan> <narrator>she said, tossing the magazine on the table,</narrator> <Jordan>β€œin our very next issue.”</Jordan>\n",
858
- "\n",
859
- "<narrator>Her body asserted itself with a restless movement of her knee, and she stood up.</narrator>\n",
860
- "\n",
861
- "<Jordan>β€œTen o’clock,”</Jordan> <narrator>she remarked, apparently finding the time on the ceiling.</narrator> <Jordan>β€œTime for this good girl to go to bed.”</Jordan>\n",
862
- "\n",
863
- "<Daisy>β€œJordan’s going to play in the tournament tomorrow,”</Daisy> <narrator>explained Daisy,</narrator> <Daisy>β€œover at Westchester.”</Daisy>\n",
864
- "\n",
865
- "<narrator>β€œOhβ€”you’re Jordan Baker.”</narrator>\n",
866
- "\n",
867
- "<narrator>I knew now why her face was familiarβ€”its pleasing contemptuous expression had looked out at me from many rotogravure pictures of the sporting life at Asheville and Hot Springs and Palm Beach. I had heard some story of her too, a critical, unpleasant story, but what it was I had forgotten long ago.</narrator>\n",
868
- "\n",
869
- "<Jordan>β€œGood night,”</Jordan> <narrator>she said softly.</narrator> <Jordan>β€œWake me at eight, won’t you.”</Jordan>\n",
870
- "\n",
871
- "<Daisy>β€œIf you’ll get up.”</Daisy>\n",
872
- "\n",
873
- "<Jordan>β€œI will. Good night, Mr. Carraway. See you anon.”</Jordan>\n",
874
- "\n",
875
- "<Daisy>β€œOf course you will,”</Daisy> <narrator>confirmed Daisy.</narrator> <Daisy>β€œIn fact I think I’ll arrange a marriage. Come over often, Nick, and I’ll sort ofβ€”ohβ€”fling you together. You knowβ€”lock you up accidentally in linen closets and push you out to sea in a boat, and all that sort of thing—”</Daisy>\n",
876
- "\n",
877
- "<Jordan>β€œGood night,”</Jordan> <narrator>called Miss Baker from the stairs.</narrator> <Jordan>β€œI haven’t heard a word.”</Jordan>\n",
878
- "\n",
879
- "<Tom>β€œShe’s a nice girl,”</Tom> <narrator>said Tom after a moment.</narrator> <Tom>β€œThey oughtn’t to let her run around the country this way.”</Tom>\n",
880
- "\n",
881
- "<Daisy>β€œWho oughtn’t to?”</Daisy> <narrator>inquired Daisy coldly.</narrator>\n",
882
- "\n",
883
- "<Tom>β€œHer family.”</Tom>\n",
884
- "\n",
885
- "<Daisy>β€œHer family is one aunt about a thousand years old. Besides, Nick’s going to look after her, aren’t you, Nick? She’s going to spend lots of weekends out here this summer. I think the home influence will be very good for her.”</Daisy>\n",
886
- "\n",
887
- "<narrator>Daisy and Tom looked at each other for a moment in silence.</narrator>\n",
888
- "\n",
889
- "<narrator>β€œIs she from New York?”</narrator> <narrator>I asked quickly.</narrator>\n",
890
- "\n",
891
- "<Daisy>β€œFrom Louisville. Our white girlhood was passed together there. Our beautiful white—”</Daisy>\n",
892
- "\n",
893
- "<Tom>β€œDid you give Nick a little heart to heart talk on the veranda?”</Tom> <narrator>demanded Tom suddenly.</narrator>\n",
894
- "\n",
895
- "<Daisy>β€œDid I?”</Daisy> <narrator>She looked at me.</narrator> <Daisy>β€œI can’t seem to remember, but I think we talked about the Nordic race. Yes, I’m sure we did. It sort of crept up on us and first thing you know—”</Daisy>\n",
896
- "\n",
897
- "<Tom>β€œDon’t believe everything you hear, Nick,”</Tom> <narrator>he advised me.</narrator>\"\n"
898
- ]
899
- }
900
- ],
901
- "source": [
902
- "chain = create_split_text_chain(llm_model=GPTModels.GPT_4o)\n",
903
- "# chain = create_split_text_chain(llm_model=GPTModels.GPT_4_TURBO_2024_04_09)\n",
904
- "with get_openai_callback() as cb:\n",
905
- " res = chain.invoke(\n",
906
- " {\"text\": samples.GATSBY_2}, config={\"callbacks\": [LCMessageLoggerAsync()]}\n",
907
- " )"
908
- ]
909
- },
910
- {
911
- "cell_type": "code",
912
- "execution_count": 5,
913
- "metadata": {},
914
- "outputs": [
915
- {
916
- "data": {
917
- "text/plain": [
918
- "SplitTextOutput(text_raw='Inside, the crimson room bloomed with light. Tom and Miss Baker sat at\\neither end of the long couch and she read aloud to him from the\\nSaturday Evening Postβ€”the words, murmurous and uninflected, running\\ntogether in a soothing tune. The lamplight, bright on his boots and\\ndull on the autumn-leaf yellow of her hair, glinted along the paper as\\nshe turned a page with a flutter of slender muscles in her arms.\\n\\nWhen we came in she held us silent for a moment with a lifted hand.\\n\\nβ€œTo be continued,” she said, tossing the magazine on the table, β€œin\\nour very next issue.”\\n\\nHer body asserted itself with a restless movement of her knee, and she\\nstood up.\\n\\nβ€œTen o’clock,” she remarked, apparently finding the time on the\\nceiling. β€œTime for this good girl to go to bed.”\\n\\nβ€œJordan’s going to play in the tournament tomorrow,” explained Daisy,\\nβ€œover at Westchester.”\\n\\nβ€œOhβ€”you’re Jordan Baker.”\\n\\nI knew now why her face was familiarβ€”its pleasing contemptuous\\nexpression had looked out at me from many rotogravure pictures of the\\nsporting life at Asheville and Hot Springs and Palm Beach. I had heard\\nsome story of her too, a critical, unpleasant story, but what it was I\\nhad forgotten long ago.\\n\\nβ€œGood night,” she said softly. β€œWake me at eight, won’t you.”\\n\\nβ€œIf you’ll get up.”\\n\\nβ€œI will. Good night, Mr. Carraway. See you anon.”\\n\\nβ€œOf course you will,” confirmed Daisy. β€œIn fact I think I’ll arrange a\\nmarriage. Come over often, Nick, and I’ll sort ofβ€”ohβ€”fling you\\ntogether. You knowβ€”lock you up accidentally in linen closets and push\\nyou out to sea in a boat, and all that sort of thing—”\\n\\nβ€œGood night,” called Miss Baker from the stairs. β€œI haven’t heard a\\nword.”\\n\\nβ€œShe’s a nice girl,” said Tom after a moment. β€œThey oughtn’t to let\\nher run around the country this way.”\\n\\nβ€œWho oughtn’t to?” inquired Daisy coldly.\\n\\nβ€œHer family.”\\n\\nβ€œHer family is one aunt about a thousand years old. Besides, Nick’s\\ngoing to look after her, aren’t you, Nick? She’s going to spend lots\\nof weekends out here this summer. I think the home influence will be\\nvery good for her.”\\n\\nDaisy and Tom looked at each other for a moment in silence.\\n\\nβ€œIs she from New York?” I asked quickly.\\n\\nβ€œFrom Louisville. Our white girlhood was passed together there. Our\\nbeautiful white—”\\n\\nβ€œDid you give Nick a little heart to heart talk on the veranda?”\\ndemanded Tom suddenly.\\n\\nβ€œDid I?” She looked at me. β€œI can’t seem to remember, but I think we\\ntalked about the Nordic race. Yes, I’m sure we did. It sort of crept\\nup on us and first thing you know—”\\n\\nβ€œDon’t believe everything you hear, Nick,” he advised me.\\n', text_annotated='<narrator>Inside, the crimson room bloomed with light. Tom and Miss Baker sat at either end of the long couch and she read aloud to him from the Saturday Evening Postβ€”the words, murmurous and uninflected, running together in a soothing tune. The lamplight, bright on his boots and dull on the autumn-leaf yellow of her hair, glinted along the paper as she turned a page with a flutter of slender muscles in her arms.</narrator>\\n\\n<narrator>When we came in she held us silent for a moment with a lifted hand.</narrator>\\n\\n<Jordan>β€œTo be continued,”</Jordan> <narrator>she said, tossing the magazine on the table,</narrator> <Jordan>β€œin our very next issue.”</Jordan>\\n\\n<narrator>Her body asserted itself with a restless movement of her knee, and she stood up.</narrator>\\n\\n<Jordan>β€œTen o’clock,”</Jordan> <narrator>she remarked, apparently finding the time on the ceiling.</narrator> <Jordan>β€œTime for this good girl to go to bed.”</Jordan>\\n\\n<Daisy>β€œJordan’s going to play in the tournament tomorrow,”</Daisy> <narrator>explained Daisy,</narrator> <Daisy>β€œover at Westchester.”</Daisy>\\n\\n<narrator>β€œOhβ€”you’re Jordan Baker.”</narrator>\\n\\n<narrator>I knew now why her face was familiarβ€”its pleasing contemptuous expression had looked out at me from many rotogravure pictures of the sporting life at Asheville and Hot Springs and Palm Beach. I had heard some story of her too, a critical, unpleasant story, but what it was I had forgotten long ago.</narrator>\\n\\n<Jordan>β€œGood night,”</Jordan> <narrator>she said softly.</narrator> <Jordan>β€œWake me at eight, won’t you.”</Jordan>\\n\\n<Daisy>β€œIf you’ll get up.”</Daisy>\\n\\n<Jordan>β€œI will. Good night, Mr. Carraway. See you anon.”</Jordan>\\n\\n<Daisy>β€œOf course you will,”</Daisy> <narrator>confirmed Daisy.</narrator> <Daisy>β€œIn fact I think I’ll arrange a marriage. Come over often, Nick, and IοΏ½οΏ½οΏ½ll sort ofβ€”ohβ€”fling you together. You knowβ€”lock you up accidentally in linen closets and push you out to sea in a boat, and all that sort of thing—”</Daisy>\\n\\n<Jordan>β€œGood night,”</Jordan> <narrator>called Miss Baker from the stairs.</narrator> <Jordan>β€œI haven’t heard a word.”</Jordan>\\n\\n<Tom>β€œShe’s a nice girl,”</Tom> <narrator>said Tom after a moment.</narrator> <Tom>β€œThey oughtn’t to let her run around the country this way.”</Tom>\\n\\n<Daisy>β€œWho oughtn’t to?”</Daisy> <narrator>inquired Daisy coldly.</narrator>\\n\\n<Tom>β€œHer family.”</Tom>\\n\\n<Daisy>β€œHer family is one aunt about a thousand years old. Besides, Nick’s going to look after her, aren’t you, Nick? She’s going to spend lots of weekends out here this summer. I think the home influence will be very good for her.”</Daisy>\\n\\n<narrator>Daisy and Tom looked at each other for a moment in silence.</narrator>\\n\\n<narrator>β€œIs she from New York?”</narrator> <narrator>I asked quickly.</narrator>\\n\\n<Daisy>β€œFrom Louisville. Our white girlhood was passed together there. Our beautiful white—”</Daisy>\\n\\n<Tom>β€œDid you give Nick a little heart to heart talk on the veranda?”</Tom> <narrator>demanded Tom suddenly.</narrator>\\n\\n<Daisy>β€œDid I?”</Daisy> <narrator>She looked at me.</narrator> <Daisy>β€œI can’t seem to remember, but I think we talked about the Nordic race. Yes, I’m sure we did. It sort of crept up on us and first thing you know—”</Daisy>\\n\\n<Tom>β€œDon’t believe everything you hear, Nick,”</Tom> <narrator>he advised me.</narrator>')"
919
- ]
920
- },
921
- "execution_count": 5,
922
- "metadata": {},
923
- "output_type": "execute_result"
924
- }
925
- ],
926
- "source": [
927
- "res"
928
- ]
929
- },
930
- {
931
- "cell_type": "code",
932
- "execution_count": 6,
933
- "metadata": {},
934
- "outputs": [
935
- {
936
- "data": {
937
- "text/plain": [
938
- "['Tom', 'Jordan', 'Daisy', 'narrator']"
939
- ]
940
- },
941
- "execution_count": 6,
942
- "metadata": {},
943
- "output_type": "execute_result"
944
- }
945
- ],
946
- "source": [
947
- "res.characters"
948
- ]
949
- },
950
- {
951
- "cell_type": "code",
952
- "execution_count": 7,
953
- "metadata": {},
954
- "outputs": [
955
- {
956
- "name": "stdout",
957
- "output_type": "stream",
958
- "text": [
959
- "<narrator>Inside, the crimson room bloomed with light. Tom and Miss Baker sat at either end of the long couch and she read aloud to him from the Saturday Evening Postβ€”the words, murmurous and uninflected, running together in a soothing tune. The lamplight, bright on his boots and dull on the autumn-leaf yellow of her hair, glinted along the paper as she turned a page with a flutter of slender muscles in her arms.</narrator>\n",
960
- "\n",
961
- "<narrator>When we came in she held us silent for a moment with a lifted hand.</narrator>\n",
962
- "\n",
963
- "<Jordan>β€œTo be continued,”</Jordan> <narrator>she said, tossing the magazine on the table,</narrator> <Jordan>β€œin our very next issue.”</Jordan>\n",
964
- "\n",
965
- "<narrator>Her body asserted itself with a restless movement of her knee, and she stood up.</narrator>\n",
966
- "\n",
967
- "<Jordan>β€œTen o’clock,”</Jordan> <narrator>she remarked, apparently finding the time on the ceiling.</narrator> <Jordan>β€œTime for this good girl to go to bed.”</Jordan>\n",
968
- "\n",
969
- "<Daisy>β€œJordan’s going to play in the tournament tomorrow,”</Daisy> <narrator>explained Daisy,</narrator> <Daisy>β€œover at Westchester.”</Daisy>\n",
970
- "\n",
971
- "<narrator>β€œOhβ€”you’re Jordan Baker.”</narrator>\n",
972
- "\n",
973
- "<narrator>I knew now why her face was familiarβ€”its pleasing contemptuous expression had looked out at me from many rotogravure pictures of the sporting life at Asheville and Hot Springs and Palm Beach. I had heard some story of her too, a critical, unpleasant story, but what it was I had forgotten long ago.</narrator>\n",
974
- "\n",
975
- "<Jordan>β€œGood night,”</Jordan> <narrator>she said softly.</narrator> <Jordan>β€œWake me at eight, won’t you.”</Jordan>\n",
976
- "\n",
977
- "<Daisy>β€œIf you’ll get up.”</Daisy>\n",
978
- "\n",
979
- "<Jordan>β€œI will. Good night, Mr. Carraway. See you anon.”</Jordan>\n",
980
- "\n",
981
- "<Daisy>β€œOf course you will,”</Daisy> <narrator>confirmed Daisy.</narrator> <Daisy>β€œIn fact I think I’ll arrange a marriage. Come over often, Nick, and I’ll sort ofβ€”ohβ€”fling you together. You knowβ€”lock you up accidentally in linen closets and push you out to sea in a boat, and all that sort of thing—”</Daisy>\n",
982
- "\n",
983
- "<Jordan>β€œGood night,”</Jordan> <narrator>called Miss Baker from the stairs.</narrator> <Jordan>β€œI haven’t heard a word.”</Jordan>\n",
984
- "\n",
985
- "<Tom>β€œShe’s a nice girl,”</Tom> <narrator>said Tom after a moment.</narrator> <Tom>β€œThey oughtn’t to let her run around the country this way.”</Tom>\n",
986
- "\n",
987
- "<Daisy>β€œWho oughtn’t to?”</Daisy> <narrator>inquired Daisy coldly.</narrator>\n",
988
- "\n",
989
- "<Tom>β€œHer family.”</Tom>\n",
990
- "\n",
991
- "<Daisy>β€œHer family is one aunt about a thousand years old. Besides, Nick’s going to look after her, aren’t you, Nick? She’s going to spend lots of weekends out here this summer. I think the home influence will be very good for her.”</Daisy>\n",
992
- "\n",
993
- "<narrator>Daisy and Tom looked at each other for a moment in silence.</narrator>\n",
994
- "\n",
995
- "<narrator>β€œIs she from New York?”</narrator> <narrator>I asked quickly.</narrator>\n",
996
- "\n",
997
- "<Daisy>β€œFrom Louisville. Our white girlhood was passed together there. Our beautiful white—”</Daisy>\n",
998
- "\n",
999
- "<Tom>β€œDid you give Nick a little heart to heart talk on the veranda?”</Tom> <narrator>demanded Tom suddenly.</narrator>\n",
1000
- "\n",
1001
- "<Daisy>β€œDid I?”</Daisy> <narrator>She looked at me.</narrator> <Daisy>β€œI can’t seem to remember, but I think we talked about the Nordic race. Yes, I’m sure we did. It sort of crept up on us and first thing you know—”</Daisy>\n",
1002
- "\n",
1003
- "<Tom>β€œDon’t believe everything you hear, Nick,”</Tom> <narrator>he advised me.</narrator>\n"
1004
- ]
1005
- }
1006
- ],
1007
- "source": [
1008
- "print(res.text_annotated)"
1009
- ]
1010
- },
1011
- {
1012
- "cell_type": "code",
1013
- "execution_count": 8,
1014
- "metadata": {},
1015
- "outputs": [
1016
- {
1017
- "name": "stdout",
1018
- "output_type": "stream",
1019
- "text": [
1020
- "characters: ['Tom', 'Jordan', 'Daisy', 'narrator']\n",
1021
- "--------------------\n",
1022
- "[narrator] Inside, the crimson room bloomed with light. Tom and Miss Baker sat at either end of the long couch and she read aloud to him from the Saturday Evening Postβ€”the words, murmurous and uninflected, running together in a soothing tune. The lamplight, bright on his boots and dull on the autumn-leaf yellow of her hair, glinted along the paper as she turned a page with a flutter of slender muscles in her arms.\n",
1023
- "[narrator] When we came in she held us silent for a moment with a lifted hand.\n",
1024
- "[Jordan] β€œTo be continued,”\n",
1025
- "[narrator] she said, tossing the magazine on the table,\n",
1026
- "[Jordan] β€œin our very next issue.”\n",
1027
- "[narrator] Her body asserted itself with a restless movement of her knee, and she stood up.\n",
1028
- "[Jordan] β€œTen o’clock,”\n",
1029
- "[narrator] she remarked, apparently finding the time on the ceiling.\n",
1030
- "[Jordan] β€œTime for this good girl to go to bed.”\n",
1031
- "[Daisy] β€œJordan’s going to play in the tournament tomorrow,”\n",
1032
- "[narrator] explained Daisy,\n",
1033
- "[Daisy] β€œover at Westchester.”\n",
1034
- "[narrator] β€œOhβ€”you’re Jordan Baker.”\n",
1035
- "[narrator] I knew now why her face was familiarβ€”its pleasing contemptuous expression had looked out at me from many rotogravure pictures of the sporting life at Asheville and Hot Springs and Palm Beach. I had heard some story of her too, a critical, unpleasant story, but what it was I had forgotten long ago.\n",
1036
- "[Jordan] β€œGood night,”\n",
1037
- "[narrator] she said softly.\n",
1038
- "[Jordan] β€œWake me at eight, won’t you.”\n",
1039
- "[Daisy] β€œIf you’ll get up.”\n",
1040
- "[Jordan] β€œI will. Good night, Mr. Carraway. See you anon.”\n",
1041
- "[Daisy] β€œOf course you will,”\n",
1042
- "[narrator] confirmed Daisy.\n",
1043
- "[Daisy] β€œIn fact I think I’ll arrange a marriage. Come over often, Nick, and I’ll sort ofβ€”ohβ€”fling you together. You knowβ€”lock you up accidentally in linen closets and push you out to sea in a boat, and all that sort of thing—”\n",
1044
- "[Jordan] β€œGood night,”\n",
1045
- "[narrator] called Miss Baker from the stairs.\n",
1046
- "[Jordan] β€œI haven’t heard a word.”\n",
1047
- "[Tom] β€œShe’s a nice girl,”\n",
1048
- "[narrator] said Tom after a moment.\n",
1049
- "[Tom] β€œThey oughtn’t to let her run around the country this way.”\n",
1050
- "[Daisy] β€œWho oughtn’t to?”\n",
1051
- "[narrator] inquired Daisy coldly.\n",
1052
- "[Tom] β€œHer family.”\n",
1053
- "[Daisy] β€œHer family is one aunt about a thousand years old. Besides, Nick’s going to look after her, aren’t you, Nick? She’s going to spend lots of weekends out here this summer. I think the home influence will be very good for her.”\n",
1054
- "[narrator] Daisy and Tom looked at each other for a moment in silence.\n",
1055
- "[narrator] β€œIs she from New York?”\n",
1056
- "[narrator] I asked quickly.\n",
1057
- "[Daisy] β€œFrom Louisville. Our white girlhood was passed together there. Our beautiful white—”\n",
1058
- "[Tom] β€œDid you give Nick a little heart to heart talk on the veranda?”\n",
1059
- "[narrator] demanded Tom suddenly.\n",
1060
- "[Daisy] β€œDid I?”\n",
1061
- "[narrator] She looked at me.\n",
1062
- "[Daisy] β€œI can’t seem to remember, but I think we talked about the Nordic race. Yes, I’m sure we did. It sort of crept up on us and first thing you know—”\n",
1063
- "[Tom] β€œDon’t believe everything you hear, Nick,”\n",
1064
- "[narrator] he advised me.\n"
1065
- ]
1066
- }
1067
- ],
1068
- "source": [
1069
- "print(res.to_pretty_text())"
1070
- ]
1071
- },
1072
- {
1073
- "cell_type": "code",
1074
- "execution_count": 9,
1075
- "metadata": {},
1076
- "outputs": [
1077
- {
1078
- "name": "stdout",
1079
- "output_type": "stream",
1080
- "text": [
1081
- "LLM usage:\n",
1082
- "\n",
1083
- "Tokens Used: 1817\n",
1084
- "\tPrompt Tokens: 877\n",
1085
- "\tCompletion Tokens: 940\n",
1086
- "Successful Requests: 1\n",
1087
- "Total Cost (USD): $0.0115925\n"
1088
- ]
1089
- }
1090
- ],
1091
- "source": [
1092
- "print(f'LLM usage:\\n\\n{cb}')"
1093
- ]
1094
- },
1095
  {
1096
  "cell_type": "code",
1097
  "execution_count": null,
@@ -1099,192 +836,6 @@
1099
  "outputs": [],
1100
  "source": []
1101
  },
1102
- {
1103
- "cell_type": "markdown",
1104
- "metadata": {},
1105
- "source": [
1106
- "## map characters to voices"
1107
- ]
1108
- },
1109
- {
1110
- "cell_type": "code",
1111
- "execution_count": 10,
1112
- "metadata": {},
1113
- "outputs": [],
1114
- "source": [
1115
- "from src.select_voice_chain import create_voice_mapping_chain"
1116
- ]
1117
- },
1118
- {
1119
- "cell_type": "code",
1120
- "execution_count": 11,
1121
- "metadata": {},
1122
- "outputs": [],
1123
- "source": [
1124
- "chain = create_voice_mapping_chain(llm_model=GPTModels.GPT_4_TURBO_2024_04_09)"
1125
- ]
1126
- },
1127
- {
1128
- "cell_type": "code",
1129
- "execution_count": 12,
1130
- "metadata": {},
1131
- "outputs": [
1132
- {
1133
- "data": {
1134
- "text/plain": [
1135
- "ChatPromptTemplate(input_variables=['characters', 'text'], input_types={}, partial_variables={'available_genders': '\"male\", \"female\"', 'available_age_groups': '\"old\", \"middle_aged\", \"young\"', 'format_instructions': 'The output should be formatted as a JSON instance that conforms to the JSON schema below.\\n\\nAs an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\\nthe object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\\n\\nHere is the output schema:\\n```\\n{\"$defs\": {\"CharacterProperties\": {\"properties\": {\"gender\": {\"title\": \"Gender\", \"type\": \"string\"}, \"age_group\": {\"title\": \"Age Group\", \"type\": \"string\"}}, \"required\": [\"gender\", \"age_group\"], \"title\": \"CharacterProperties\", \"type\": \"object\"}}, \"properties\": {\"character2props\": {\"additionalProperties\": {\"$ref\": \"#/$defs/CharacterProperties\"}, \"title\": \"Character2Props\", \"type\": \"object\"}}, \"required\": [\"character2props\"]}\\n```'}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['available_age_groups', 'available_genders', 'format_instructions'], input_types={}, partial_variables={}, template='You are a helpful assistant proficient in literature and psychology.\\nOur goal is to create an audio book from the given text.\\nFor that we need to hire voice actors.\\nPlease help us to find the right actor for each character present in the text.\\n\\nYou are provided with the text split by the characters\\nto whom text parts belong to.\\n\\nYour task is to assign available properties to each character provided.\\nList of available properties:\\n- gender: {available_genders}\\n- age_group: {available_age_groups}\\n\\nNOTES:\\n- assign EXACTLY ONE property value for each property\\n- select properties values ONLY from the list of AVAILABLE property values\\n- fill properties for ALL characters from the list provided\\n- DO NOT include any characters absent in the list provided\\n\\n{format_instructions}\\n'), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['characters', 'text'], input_types={}, partial_variables={}, template='<text>\\n{text}\\n</text>\\n\\n<characters>\\n{characters}\\n</characters>\\n'), additional_kwargs={})])\n",
1136
- "| RunnableBinding(bound=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x174a82d80>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x174a812e0>, root_client=<openai.OpenAI object at 0x174a82d50>, root_async_client=<openai.AsyncOpenAI object at 0x174a81730>, model_name='gpt-4-turbo-2024-04-09', temperature=0.0, model_kwargs={}, openai_api_key=SecretStr('**********'), request_timeout=Timeout(connect=4, read=60, write=60, pool=60)), kwargs={'response_format': {'type': 'json_object'}}, config={}, config_factories=[])\n",
1137
- "| PydanticOutputParser(pydantic_object=<class 'src.select_voice_chain.AllCharactersProperties'>)"
1138
- ]
1139
- },
1140
- "execution_count": 12,
1141
- "metadata": {},
1142
- "output_type": "execute_result"
1143
- }
1144
- ],
1145
- "source": [
1146
- "chain"
1147
- ]
1148
- },
1149
- {
1150
- "cell_type": "code",
1151
- "execution_count": 14,
1152
- "metadata": {},
1153
- "outputs": [
1154
- {
1155
- "name": "stderr",
1156
- "output_type": "stream",
1157
- "text": [
1158
- "2024-10-10 02:37:46,347 [INFO] audio-books (lc_callbacks.py): call to gpt-4-turbo-2024-04-09 with 2 messages:\n",
1159
- "{'role': 'system', 'content': 'You are a helpful assistant proficient in literature and psychology.\\nOur goal is to create an audio book from the given text.\\nFor that we need to hire voice actors.\\nPlease help us to find the right actor for each character present in the text.\\n\\nYou are provided with the text split by the characters\\nto whom text parts belong to.\\n\\nYour task is to assign available properties to each character provided.\\nList of available properties:\\n- gender: \"male\", \"female\"\\n- age_group: \"old\", \"middle_aged\", \"young\"\\n\\nNOTES:\\n- assign EXACTLY ONE property value for each property\\n- select properties values ONLY from the list of AVAILABLE property values\\n- fill properties for ALL characters from the list provided\\n- DO NOT include any characters absent in the list provided\\n\\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\\n\\nAs an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\\nthe object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\\n\\nHere is the output schema:\\n```\\n{\"$defs\": {\"CharacterProperties\": {\"properties\": {\"gender\": {\"title\": \"Gender\", \"type\": \"string\"}, \"age_group\": {\"title\": \"Age Group\", \"type\": \"string\"}}, \"required\": [\"gender\", \"age_group\"], \"title\": \"CharacterProperties\", \"type\": \"object\"}}, \"properties\": {\"character2props\": {\"additionalProperties\": {\"$ref\": \"#/$defs/CharacterProperties\"}, \"title\": \"Character2Props\", \"type\": \"object\"}}, \"required\": [\"character2props\"]}\\n```\\n'}\n",
1160
- "{'role': 'human', 'content': \"<text>\\n<narrator>Inside, the crimson room bloomed with light. Tom and Miss Baker sat at either end of the long couch and she read aloud to him from the Saturday Evening Postβ€”the words, murmurous and uninflected, running together in a soothing tune. The lamplight, bright on his boots and dull on the autumn-leaf yellow of her hair, glinted along the paper as she turned a page with a flutter of slender muscles in her arms.</narrator>\\n\\n<narrator>When we came in she held us silent for a moment with a lifted hand.</narrator>\\n\\n<Jordan>β€œTo be continued,”</Jordan> <narrator>she said, tossing the magazine on the table,</narrator> <Jordan>β€œin our very next issue.”</Jordan>\\n\\n<narrator>Her body asserted itself with a restless movement of her knee, and she stood up.</narrator>\\n\\n<Jordan>β€œTen o’clock,”</Jordan> <narrator>she remarked, apparently finding the time on the ceiling.</narrator> <Jordan>β€œTime for this good girl to go to bed.”</Jordan>\\n\\n<Daisy>β€œJordan’s going to play in the tournament tomorrow,”</Daisy> <narrator>explained Daisy,</narrator> <Daisy>β€œover at Westchester.”</Daisy>\\n\\n<narrator>β€œOhβ€”you’re Jordan Baker.”</narrator>\\n\\n<narrator>I knew now why her face was familiarβ€”its pleasing contemptuous expression had looked out at me from many rotogravure pictures of the sporting life at Asheville and Hot Springs and Palm Beach. I had heard some story of her too, a critical, unpleasant story, but what it was I had forgotten long ago.</narrator>\\n\\n<Jordan>β€œGood night,”</Jordan> <narrator>she said softly.</narrator> <Jordan>β€œWake me at eight, won’t you.”</Jordan>\\n\\n<Daisy>β€œIf you’ll get up.”</Daisy>\\n\\n<Jordan>β€œI will. Good night, Mr. Carraway. See you anon.”</Jordan>\\n\\n<Daisy>β€œOf course you will,”</Daisy> <narrator>confirmed Daisy.</narrator> <Daisy>β€œIn fact I think I’ll arrange a marriage. Come over often, Nick, and I’ll sort ofβ€”ohβ€”fling you together. You knowβ€”lock you up accidentally in linen closets and push you out to sea in a boat, and all that sort of thing—”</Daisy>\\n\\n<Jordan>β€œGood night,”</Jordan> <narrator>called Miss Baker from the stairs.</narrator> <Jordan>β€œI haven’t heard a word.”</Jordan>\\n\\n<Tom>β€œShe’s a nice girl,”</Tom> <narrator>said Tom after a moment.</narrator> <Tom>β€œThey oughtn’t to let her run around the country this way.”</Tom>\\n\\n<Daisy>β€œWho oughtn’t to?”</Daisy> <narrator>inquired Daisy coldly.</narrator>\\n\\n<Tom>β€œHer family.”</Tom>\\n\\n<Daisy>β€œHer family is one aunt about a thousand years old. Besides, Nick’s going to look after her, aren’t you, Nick? She’s going to spend lots of weekends out here this summer. I think the home influence will be very good for her.”</Daisy>\\n\\n<narrator>Daisy and Tom looked at each other for a moment in silence.</narrator>\\n\\n<narrator>β€œIs she from New York?”</narrator> <narrator>I asked quickly.</narrator>\\n\\n<Daisy>β€œFrom Louisville. Our white girlhood was passed together there. Our beautiful white—”</Daisy>\\n\\n<Tom>β€œDid you give Nick a little heart to heart talk on the veranda?”</Tom> <narrator>demanded Tom suddenly.</narrator>\\n\\n<Daisy>β€œDid I?”</Daisy> <narrator>She looked at me.</narrator> <Daisy>β€œI can’t seem to remember, but I think we talked about the Nordic race. Yes, I’m sure we did. It sort of crept up on us and first thing you know—”</Daisy>\\n\\n<Tom>β€œDon’t believe everything you hear, Nick,”</Tom> <narrator>he advised me.</narrator>\\n</text>\\n\\n<characters>\\n['Tom', 'Jordan', 'Daisy', 'narrator']\\n</characters>\\n\"}\n",
1161
- "2024-10-10 02:37:52,060 [INFO] httpx (_client.py): HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
1162
- "2024-10-10 02:37:52,063 [INFO] audio-books (lc_callbacks.py): raw LLM response: \"{\n",
1163
- " \"character2props\": {\n",
1164
- " \"Tom\": {\n",
1165
- " \"gender\": \"male\",\n",
1166
- " \"age_group\": \"middle_aged\"\n",
1167
- " },\n",
1168
- " \"Jordan\": {\n",
1169
- " \"gender\": \"female\",\n",
1170
- " \"age_group\": \"young\"\n",
1171
- " },\n",
1172
- " \"Daisy\": {\n",
1173
- " \"gender\": \"female\",\n",
1174
- " \"age_group\": \"young\"\n",
1175
- " },\n",
1176
- " \"narrator\": {\n",
1177
- " \"gender\": \"male\",\n",
1178
- " \"age_group\": \"middle_aged\"\n",
1179
- " }\n",
1180
- " }\n",
1181
- "}\"\n"
1182
- ]
1183
- }
1184
- ],
1185
- "source": [
1186
- "res2 = chain.invoke(\n",
1187
- " {\"text\": res.text_annotated, \"characters\": res.characters},\n",
1188
- " config={\"callbacks\": [LCMessageLoggerAsync()]},\n",
1189
- ")"
1190
- ]
1191
- },
1192
- {
1193
- "cell_type": "code",
1194
- "execution_count": 15,
1195
- "metadata": {},
1196
- "outputs": [
1197
- {
1198
- "data": {
1199
- "text/plain": [
1200
- "AllCharactersProperties(character2props={'Tom': CharacterProperties(gender='male', age_group='middle_aged'), 'Jordan': CharacterProperties(gender='female', age_group='young'), 'Daisy': CharacterProperties(gender='female', age_group='young'), 'narrator': CharacterProperties(gender='male', age_group='middle_aged')})"
1201
- ]
1202
- },
1203
- "execution_count": 15,
1204
- "metadata": {},
1205
- "output_type": "execute_result"
1206
- }
1207
- ],
1208
- "source": [
1209
- "res2"
1210
- ]
1211
- },
1212
- {
1213
- "cell_type": "code",
1214
- "execution_count": null,
1215
- "metadata": {},
1216
- "outputs": [
1217
- {
1218
- "name": "stdout",
1219
- "output_type": "stream",
1220
- "text": [
1221
- "<class 'pandas.core.frame.DataFrame'>\n",
1222
- "RangeIndex: 22 entries, 0 to 21\n",
1223
- "Data columns (total 14 columns):\n",
1224
- " # Column Non-Null Count Dtype \n",
1225
- "--- ------ -------------- ----- \n",
1226
- " 0 voice_id 22 non-null object \n",
1227
- " 1 name 22 non-null object \n",
1228
- " 2 preview_url 22 non-null object \n",
1229
- " 3 owner_id 0 non-null float64\n",
1230
- " 4 permission_on_resource 2 non-null object \n",
1231
- " 5 is_legacy 22 non-null bool \n",
1232
- " 6 is_mixed 22 non-null bool \n",
1233
- " 7 accent 22 non-null object \n",
1234
- " 8 description 20 non-null object \n",
1235
- " 9 age 22 non-null object \n",
1236
- " 10 gender 22 non-null object \n",
1237
- " 11 category 22 non-null object \n",
1238
- " 12 language 2 non-null object \n",
1239
- " 13 descriptive 2 non-null object \n",
1240
- "dtypes: bool(2), float64(1), object(11)\n",
1241
- "memory usage: 2.2+ KB\n"
1242
- ]
1243
- }
1244
- ],
1245
- "source": [
1246
- "voices = pd.read_csv(\"11labs_available_tts_voices.csv\")\n",
1247
- "voices.info()"
1248
- ]
1249
- },
1250
- {
1251
- "cell_type": "code",
1252
- "execution_count": null,
1253
- "metadata": {},
1254
- "outputs": [
1255
- {
1256
- "data": {
1257
- "text/plain": [
1258
- "array(['middle_aged', 'young', 'old'], dtype=object)"
1259
- ]
1260
- },
1261
- "metadata": {},
1262
- "output_type": "display_data"
1263
- }
1264
- ],
1265
- "source": [
1266
- "voices[\"age\"].unique()"
1267
- ]
1268
- },
1269
- {
1270
- "cell_type": "code",
1271
- "execution_count": null,
1272
- "metadata": {},
1273
- "outputs": [
1274
- {
1275
- "data": {
1276
- "text/plain": [
1277
- "array(['female', 'male', 'non-binary', 'neutral'], dtype=object)"
1278
- ]
1279
- },
1280
- "metadata": {},
1281
- "output_type": "display_data"
1282
- }
1283
- ],
1284
- "source": [
1285
- "voices[\"gender\"].unique()"
1286
- ]
1287
- },
1288
  {
1289
  "cell_type": "code",
1290
  "execution_count": null,
 
1
  {
2
  "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "## initialize"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "code",
12
+ "execution_count": null,
13
+ "metadata": {},
14
+ "outputs": [],
15
+ "source": [
16
+ "%cd .."
17
+ ]
18
+ },
19
  {
20
  "cell_type": "code",
21
  "execution_count": 1,
 
28
  },
29
  {
30
  "cell_type": "code",
31
+ "execution_count": 12,
32
  "metadata": {},
33
  "outputs": [],
34
  "source": [
 
 
35
  "import dotenv\n",
36
+ "import pandas as pd"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ]
38
  },
39
  {
 
829
  "outputs": [],
830
  "source": []
831
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
832
  {
833
  "cell_type": "code",
834
  "execution_count": null,
 
836
  "outputs": [],
837
  "source": []
838
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
839
  {
840
  "cell_type": "code",
841
  "execution_count": null,
filter_voices.ipynb β†’ notebooks/filter_voices.ipynb RENAMED
@@ -1,5 +1,14 @@
1
  {
2
  "cells": [
 
 
 
 
 
 
 
 
 
3
  {
4
  "cell_type": "code",
5
  "execution_count": 1,
 
1
  {
2
  "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": null,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "%cd .."
10
+ ]
11
+ },
12
  {
13
  "cell_type": "code",
14
  "execution_count": 1,
notebooks/playground.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
pyproject.toml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ [tool.black]
2
+ line-length = 100
3
+ target-version = ['py311']
4
+ skip-string-normalization = true
5
+
6
+ [tool.isort]
7
+ profile = "black"
8
+ line_length = 100
requirements.txt CHANGED
@@ -9,4 +9,6 @@ elevenlabs
9
  gradio
10
  python-dotenv
11
  streamlit
12
- pypdf
 
 
 
9
  gradio
10
  python-dotenv
11
  streamlit
12
+ pypdf
13
+ black
14
+ isort
scripts/add_voices.py CHANGED
@@ -7,7 +7,6 @@ from elevenlabs import ElevenLabs
7
  from elevenlabs.core import ApiError
8
  from tqdm.auto import tqdm
9
 
10
-
11
  logging.basicConfig(
12
  level=logging.INFO,
13
  format="%(asctime)s [%(levelname)s] %(name)s (%(filename)s): %(message)s",
@@ -23,9 +22,7 @@ load_dotenv()
23
  @click.option("-i", "--input-csv-path", default="data/11labs_tts_voices.csv")
24
  def main(*, api_key: str | None, input_csv_path: str) -> None:
25
  if api_key is None:
26
- raise OSError(
27
- "Who's gonna set the `ELEVEN_LABS_API_KEY` environmental variable?"
28
- )
29
 
30
  client = ElevenLabs(api_key=api_key)
31
  voices_to_import = pd.read_csv(input_csv_path)
@@ -39,13 +36,11 @@ def main(*, api_key: str | None, input_csv_path: str) -> None:
39
  )
40
  except ApiError:
41
  logger.error(
42
- f"Shared voice with `{public_user_id = }`, `{voice_id = }` "
43
- "already added."
44
  )
45
  else:
46
  logger.info(
47
- f"Added shared voice with `{public_user_id = }`, `{voice_id = }`, "
48
- f"`{name = }`."
49
  )
50
 
51
 
 
7
  from elevenlabs.core import ApiError
8
  from tqdm.auto import tqdm
9
 
 
10
  logging.basicConfig(
11
  level=logging.INFO,
12
  format="%(asctime)s [%(levelname)s] %(name)s (%(filename)s): %(message)s",
 
22
  @click.option("-i", "--input-csv-path", default="data/11labs_tts_voices.csv")
23
  def main(*, api_key: str | None, input_csv_path: str) -> None:
24
  if api_key is None:
25
+ raise OSError("Who's gonna set the `ELEVEN_LABS_API_KEY` environmental variable?")
 
 
26
 
27
  client = ElevenLabs(api_key=api_key)
28
  voices_to_import = pd.read_csv(input_csv_path)
 
36
  )
37
  except ApiError:
38
  logger.error(
39
+ f"Shared voice with `{public_user_id = }`, `{voice_id = }` " "already added."
 
40
  )
41
  else:
42
  logger.info(
43
+ f"Added shared voice with `{public_user_id = }`, `{voice_id = }`, " f"`{name = }`."
 
44
  )
45
 
46
 
scripts/export_available_voices.py CHANGED
@@ -6,7 +6,6 @@ import pandas as pd
6
  from dotenv import load_dotenv
7
  from elevenlabs import ElevenLabs
8
 
9
-
10
  logging.basicConfig(
11
  level=logging.INFO,
12
  format="%(asctime)s [%(levelname)s] %(name)s (%(filename)s): %(message)s",
@@ -22,9 +21,7 @@ load_dotenv()
22
  @click.option("-o", "--output-csv-path", default="data/11labs_available_tts_voices.csv")
23
  def main(*, api_key: str | None, output_csv_path: str) -> None:
24
  if api_key is None:
25
- raise OSError(
26
- "Who's gonna set the `ELEVEN_LABS_API_KEY` environmental variable?"
27
- )
28
 
29
  client = ElevenLabs(api_key=api_key)
30
  response = client.voices.get_all()
 
6
  from dotenv import load_dotenv
7
  from elevenlabs import ElevenLabs
8
 
 
9
  logging.basicConfig(
10
  level=logging.INFO,
11
  format="%(asctime)s [%(levelname)s] %(name)s (%(filename)s): %(message)s",
 
21
  @click.option("-o", "--output-csv-path", default="data/11labs_available_tts_voices.csv")
22
  def main(*, api_key: str | None, output_csv_path: str) -> None:
23
  if api_key is None:
24
+ raise OSError("Who's gonna set the `ELEVEN_LABS_API_KEY` environmental variable?")
 
 
25
 
26
  client = ElevenLabs(api_key=api_key)
27
  response = client.voices.get_all()
src/audio_generators.py DELETED
@@ -1,315 +0,0 @@
1
- import asyncio
2
- import os
3
- import re
4
- from pathlib import Path
5
- from uuid import uuid4
6
- import random
7
-
8
- from langchain_community.callbacks import get_openai_callback
9
- from pydub import AudioSegment
10
-
11
- from src.lc_callbacks import LCMessageLoggerAsync
12
- from src.tts import tts_astream_consumed, sound_generation_consumed
13
- from src.utils import consume_aiter
14
- from src.emotions.generation import (
15
- EffectGeneratorAsync,
16
- TextPreparationForTTSTaskOutput,
17
- )
18
- from src.emotions.utils import add_overlay_for_audio
19
- from src.config import ELEVENLABS_MAX_PARALLEL, logger, OPENAI_MAX_PARALLEL
20
- from src.text_split_chain import SplitTextOutput
21
-
22
-
23
- class AudioGeneratorSimple:
24
-
25
- async def generate_audio(
26
- self,
27
- text_split: SplitTextOutput,
28
- character_to_voice: dict[str, str],
29
- ) -> Path:
30
- semaphore = asyncio.Semaphore(ELEVENLABS_MAX_PARALLEL)
31
-
32
- async def tts_astream_with_semaphore(voice_id: str, text: str):
33
- async with semaphore:
34
- bytes_ = await tts_astream_consumed(voice_id=voice_id, text=text)
35
- # bytes_ = await consume_aiter(iter_)
36
- return bytes_
37
-
38
- tasks = []
39
- for character_phrase in text_split.phrases:
40
- voice_id = character_to_voice[character_phrase.character]
41
- task = tts_astream_with_semaphore(
42
- voice_id=voice_id, text=character_phrase.text
43
- )
44
- tasks.append(task)
45
-
46
- results = await asyncio.gather(*tasks)
47
-
48
- save_dir = Path("data") / "books"
49
- save_dir.mkdir(exist_ok=True)
50
- audio_combined_fp = save_dir / f"{uuid4()}.wav"
51
-
52
- logger.info(f'saving generated audio book to: "{audio_combined_fp}"')
53
- with open(audio_combined_fp, "wb") as ab:
54
- for result in results:
55
- for chunk in result:
56
- ab.write(chunk)
57
-
58
- return audio_combined_fp
59
-
60
-
61
- class AudioGeneratorWithEffects:
62
-
63
- def __init__(self):
64
- self.effect_generator = EffectGeneratorAsync(predict_duration=True)
65
- self.semaphore = asyncio.Semaphore(ELEVENLABS_MAX_PARALLEL)
66
- self.temp_files = []
67
-
68
- async def generate_audio(
69
- self,
70
- text_split: SplitTextOutput,
71
- character_to_voice: dict[str, str],
72
- out_path: Path | None = None,
73
- *,
74
- generate_effects: bool = True,
75
- ) -> Path:
76
- """Main method to generate the audiobook with TTS, emotion, and sound effects."""
77
- num_lines = len(text_split.phrases)
78
- lines_for_sound_effect = self._select_lines_for_sound_effect(
79
- num_lines,
80
- fraction=float(0.2 * generate_effects),
81
- )
82
- logger.info(f"{generate_effects = }, {lines_for_sound_effect = }")
83
-
84
- data_for_tts, data_for_sound_effects = await self._prepare_text_for_tts(
85
- text_split, lines_for_sound_effect
86
- )
87
-
88
- tts_results, self.temp_files = await self._generate_tts_audio(
89
- text_split, data_for_tts, character_to_voice
90
- )
91
-
92
- audio_chunks = await self._add_sound_effects(
93
- tts_results, lines_for_sound_effect, data_for_sound_effects, self.temp_files
94
- )
95
-
96
- normalized_audio_chunks = self._normalize_audio_chunks(
97
- audio_chunks, self.temp_files
98
- )
99
-
100
- final_output = self._merge_audio_files(
101
- normalized_audio_chunks, save_path=out_path
102
- )
103
-
104
- self._cleanup_temp_files(self.temp_files)
105
-
106
- return final_output
107
-
108
- def _select_lines_for_sound_effect(
109
- self, num_lines: int, fraction: float
110
- ) -> list[int]:
111
- """Select % of the lines randomly for sound effect generation."""
112
- return random.sample(range(num_lines), k=int(fraction * num_lines))
113
-
114
- async def _prepare_text_for_tts(
115
- self, text_split: SplitTextOutput, lines_for_sound_effect: list[int]
116
- ) -> tuple[list[dict], list[dict]]:
117
- semaphore = asyncio.Semaphore(OPENAI_MAX_PARALLEL)
118
-
119
- async def run_task_with_semaphore(func, **params):
120
- async with semaphore:
121
- outputs = await func(**params)
122
- return outputs
123
-
124
- task_emotion_code = "add_emotion"
125
- task_effects_code = "add_effects"
126
-
127
- tasks = []
128
-
129
- for idx, character_phrase in enumerate(text_split.phrases):
130
- character_text = character_phrase.text.strip().lower()
131
-
132
- tasks.append(
133
- run_task_with_semaphore(
134
- func=self.effect_generator.add_emotion_to_text,
135
- text=character_text,
136
- )
137
- )
138
-
139
- # If this line needs sound effects, generate parameters
140
- if idx in lines_for_sound_effect:
141
- tasks.append(
142
- run_task_with_semaphore(
143
- func=self.effect_generator.generate_parameters_for_sound_effect,
144
- text=character_text,
145
- )
146
- )
147
-
148
- tasks_results: list[TextPreparationForTTSTaskOutput] = []
149
- tasks_results = await asyncio.gather(*tasks)
150
-
151
- emotion_tasks_results = [
152
- x.output for x in tasks_results if x.task == task_emotion_code
153
- ]
154
- effects_tasks_results = [
155
- x.output for x in tasks_results if x.task == task_effects_code
156
- ]
157
-
158
- return emotion_tasks_results, effects_tasks_results
159
-
160
- async def _generate_tts_audio(
161
- self,
162
- text_split: SplitTextOutput,
163
- data_for_tts: list[dict],
164
- character_to_voice: dict[str, str],
165
- ) -> tuple[list[str], list[str]]:
166
- """Generate TTS audio for modified text."""
167
- tasks_for_tts = []
168
- temp_files = []
169
-
170
- async def tts_astream_with_semaphore(voice_id: str, text: str, params: dict):
171
- async with self.semaphore:
172
- bytes_ = await tts_astream_consumed(
173
- voice_id=voice_id, text=text, params=params
174
- )
175
- # bytes_ = await consume_aiter(iter_)
176
- return bytes_
177
-
178
- for idx, (data_item, character_phrase) in enumerate(
179
- zip(data_for_tts, text_split.phrases)
180
- ):
181
- voice_id = character_to_voice[character_phrase.character]
182
-
183
- task = tts_astream_with_semaphore(
184
- voice_id=voice_id,
185
- text=data_item["modified_text"],
186
- params=data_item["params"],
187
- )
188
- tasks_for_tts.append(task)
189
-
190
- tts_results = await asyncio.gather(*tasks_for_tts)
191
-
192
- # Save the results to temporary files
193
- tts_audio_files = []
194
- for idx, tts_result in enumerate(tts_results):
195
- tts_filename = f"tts_output_{idx}.wav"
196
- with open(tts_filename, "wb") as ab:
197
- for chunk in tts_result:
198
- ab.write(chunk)
199
- tts_audio_files.append(tts_filename)
200
- temp_files.append(tts_filename)
201
-
202
- return tts_audio_files, temp_files
203
-
204
- async def _add_sound_effects(
205
- self,
206
- tts_audio_files: list[str],
207
- lines_for_sound_effect: list[int],
208
- data_for_sound_effects: list[dict],
209
- temp_files: list[str],
210
- ) -> list[str]:
211
- """Add sound effects to the selected lines."""
212
-
213
- semaphore = asyncio.Semaphore(ELEVENLABS_MAX_PARALLEL)
214
-
215
- async def _process_single_phrase(
216
- tts_filename: str,
217
- sound_effect_data: dict | None,
218
- sound_effect_filename: str,
219
- ):
220
- if sound_effect_data is None:
221
- return (tts_filename, [])
222
-
223
- async with semaphore:
224
- sound_result = await sound_generation_consumed(sound_effect_data)
225
-
226
- # save to file
227
- with open(sound_effect_filename, "wb") as ab:
228
- for chunk in sound_result:
229
- ab.write(chunk)
230
-
231
- # overlay sound effect on TTS audio
232
- tts_with_effects_filename = add_overlay_for_audio(
233
- main_audio_filename=tts_filename,
234
- sound_effect_filename=sound_effect_filename,
235
- cycling_effect=True,
236
- decrease_effect_volume=5,
237
- )
238
- tmp_files = [sound_effect_filename, tts_with_effects_filename]
239
- return (tts_with_effects_filename, tmp_files)
240
-
241
- tasks = []
242
- for idx, tts_filename in enumerate(tts_audio_files):
243
- sound_effect_filename = f"sound_effect_{idx}.wav"
244
-
245
- if idx not in lines_for_sound_effect:
246
- tasks.append(
247
- _process_single_phrase(
248
- tts_filename=tts_filename,
249
- sound_effect_data=None,
250
- sound_effect_filename=sound_effect_filename,
251
- )
252
- )
253
- else:
254
- sound_effect_data = data_for_sound_effects.pop(0)
255
- tasks.append(
256
- _process_single_phrase(
257
- tts_filename=tts_filename,
258
- sound_effect_data=sound_effect_data,
259
- sound_effect_filename=sound_effect_filename,
260
- )
261
- )
262
-
263
- outputs = await asyncio.gather(*tasks)
264
- audio_chunks = [x[0] for x in outputs]
265
- tmp_files_to_add = [item for x in outputs for item in x[1]]
266
- temp_files.extend(tmp_files_to_add)
267
-
268
- return audio_chunks
269
-
270
- def _normalize_audio(
271
- self, audio_segment: AudioSegment, target_dBFS: float = -20.0
272
- ) -> AudioSegment:
273
- """Normalize an audio segment to the target dBFS level."""
274
- change_in_dBFS = target_dBFS - audio_segment.dBFS
275
- return audio_segment.apply_gain(change_in_dBFS)
276
-
277
- def _normalize_audio_chunks(
278
- self, audio_filenames: list[str], temp_files, target_dBFS: float = -20.0
279
- ) -> list[str]:
280
- """Normalize all audio chunks to the target volume level."""
281
- normalized_files = []
282
- for audio_file in audio_filenames:
283
- audio_segment = AudioSegment.from_file(audio_file)
284
- normalized_audio = self._normalize_audio(audio_segment, target_dBFS)
285
-
286
- normalized_filename = f"normalized_{Path(audio_file).stem}.wav"
287
- normalized_audio.export(normalized_filename, format="wav")
288
- normalized_files.append(normalized_filename)
289
- temp_files.append(normalized_filename)
290
-
291
- return normalized_files
292
-
293
- def _merge_audio_files(
294
- self, audio_filenames: list[str], save_path: Path | None = None
295
- ) -> Path:
296
- """Helper function to merge multiple audio files into one."""
297
- combined = AudioSegment.from_file(audio_filenames[0])
298
- for filename in audio_filenames[1:]:
299
- next_audio = AudioSegment.from_file(filename)
300
- combined += next_audio # Concatenate the audio
301
-
302
- if save_path is None:
303
- save_dir = Path("data") / "books"
304
- save_dir.mkdir(exist_ok=True)
305
- save_path = save_dir / f"{uuid4()}.wav"
306
- combined.export(save_path, format="wav")
307
- return Path(save_path)
308
-
309
- def _cleanup_temp_files(self, temp_files: list[str]) -> None:
310
- """Helper function to delete all temporary files."""
311
- for temp_file in temp_files:
312
- try:
313
- os.remove(temp_file)
314
- except FileNotFoundError:
315
- continue
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/builder.py CHANGED
@@ -1,32 +1,100 @@
 
 
 
 
 
 
 
1
  from langchain_community.callbacks import get_openai_callback
 
 
2
 
3
- from src.audio_generators import AudioGeneratorWithEffects
 
 
 
 
 
 
4
  from src.lc_callbacks import LCMessageLoggerAsync
5
- from src.select_voice_chain import SelectVoiceChainOutput, VoiceSelector
 
 
 
 
 
 
 
 
 
 
 
 
6
  from src.text_split_chain import SplitTextOutput, create_split_text_chain
7
- from src.utils import GPTModels
 
 
 
 
 
 
 
8
 
9
 
10
- class AudiobookBuilder:
 
 
11
 
12
- def __init__(self):
 
 
13
  self.voice_selector = VoiceSelector()
14
- self.audio_generator = AudioGeneratorWithEffects()
 
 
 
 
 
15
 
16
- async def split_text(self, text: str) -> SplitTextOutput:
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  chain = create_split_text_chain(llm_model=GPTModels.GPT_4o)
18
  with get_openai_callback() as cb:
19
  chain_out = await chain.ainvoke(
20
  {"text": text}, config={"callbacks": [LCMessageLoggerAsync()]}
21
  )
 
22
  return chain_out
23
 
24
- async def map_characters_to_voices(
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  self, text_split: SplitTextOutput
26
  ) -> SelectVoiceChainOutput:
27
- chain = self.voice_selector.create_voice_mapping_chain(
28
- llm_model=GPTModels.GPT_4o
29
- )
30
  with get_openai_callback() as cb:
31
  chain_out = await chain.ainvoke(
32
  {
@@ -35,17 +103,495 @@ class AudiobookBuilder:
35
  },
36
  config={"callbacks": [LCMessageLoggerAsync()]},
37
  )
 
38
  return chain_out
39
 
40
- async def run(self, text: str, *, generate_effects: bool):
41
- text_split = await self.split_text(text)
42
- select_voice_chain_out = await self.map_characters_to_voices(
43
- text_split=text_split
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  )
45
- # TODO: show select_voice_chain_out.character2props on UI
46
- out_path = await self.audio_generator.generate_audio(
47
- text_split=text_split,
48
- character_to_voice=select_voice_chain_out.character2voice,
49
- generate_effects=generate_effects,
 
50
  )
51
- return out_path
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ import os
3
+ from asyncio import TaskGroup
4
+ from pathlib import Path
5
+ from typing import Any, Callable, List
6
+ from uuid import uuid4
7
+
8
  from langchain_community.callbacks import get_openai_callback
9
+ from pydantic import BaseModel
10
+ from pydub import AudioSegment
11
 
12
+ from src import tts, utils
13
+ from src.config import (
14
+ CONTEXT_CHAR_LEN_FOR_TTS,
15
+ ELEVENLABS_MAX_PARALLEL,
16
+ OPENAI_MAX_PARALLEL,
17
+ logger,
18
+ )
19
  from src.lc_callbacks import LCMessageLoggerAsync
20
+ from src.preprocess_tts_emotions_chain import TTSParamProcessor
21
+ from src.schemas import SoundEffectsParams, TTSParams, TTSTimestampsAlignment, TTSTimestampsResponse
22
+ from src.select_voice_chain import (
23
+ CharacterPropertiesNullable,
24
+ SelectVoiceChainOutput,
25
+ VoiceSelector,
26
+ )
27
+ from src.sound_effects_design import (
28
+ SoundEffectDescription,
29
+ SoundEffectsDesignOutput,
30
+ create_sound_effects_design_chain,
31
+ )
32
+ from src.text_modification_chain import modify_text_chain
33
  from src.text_split_chain import SplitTextOutput, create_split_text_chain
34
+ from src.utils import GPTModels, prettify_unknown_character_label
35
+ from src.web.constructor import HTMLGenerator
36
+ from src.web.utils import (
37
+ create_status_html,
38
+ generate_text_split_inner_html_no_effect,
39
+ generate_text_split_inner_html_with_effects,
40
+ generate_voice_mapping_inner_html,
41
+ )
42
 
43
 
44
+ class TTSPhrasesGenerationOutput(BaseModel):
45
+ audio_fps: list[str]
46
+ char2time: TTSTimestampsAlignment
47
 
48
+
49
+ class AudiobookBuilder:
50
+ def __init__(self, rm_artifacts: bool = False):
51
  self.voice_selector = VoiceSelector()
52
+ self.params_tts_processor = TTSParamProcessor()
53
+ self.rm_artifacts = rm_artifacts
54
+ self.min_sound_effect_duration_sec = 1
55
+ self.sound_effects_prompt_influence = 0.75 # seems to work nicely
56
+ self.html_generator = HTMLGenerator()
57
+ self.name = type(self).__name__
58
 
59
+ @staticmethod
60
+ async def _prepare_text_for_tts(text: str) -> str:
61
+ chain = modify_text_chain(llm_model=GPTModels.GPT_4o)
62
+ with get_openai_callback() as cb:
63
+ result = await chain.ainvoke(
64
+ {"text": text}, config={"callbacks": [LCMessageLoggerAsync()]}
65
+ )
66
+ logger.info(
67
+ f'End of modifying text with caps and symbols(?, !, ...). Openai callback stats: {cb}'
68
+ )
69
+ return result.text_modified
70
+
71
+ @staticmethod
72
+ async def _split_text(text: str) -> SplitTextOutput:
73
  chain = create_split_text_chain(llm_model=GPTModels.GPT_4o)
74
  with get_openai_callback() as cb:
75
  chain_out = await chain.ainvoke(
76
  {"text": text}, config={"callbacks": [LCMessageLoggerAsync()]}
77
  )
78
+ logger.info(f'end of splitting text into characters. openai callback stats: {cb}')
79
  return chain_out
80
 
81
+ @staticmethod
82
+ async def _design_sound_effects(text: str) -> SoundEffectsDesignOutput:
83
+ chain = create_sound_effects_design_chain(llm_model=GPTModels.GPT_4o)
84
+ with get_openai_callback() as cb:
85
+ res = await chain.ainvoke(
86
+ {"text": text}, config={"callbacks": [LCMessageLoggerAsync()]}
87
+ )
88
+ logger.info(
89
+ f'designed {len(res.sound_effects_descriptions)} sound effects. '
90
+ f'openai callback stats: {cb}'
91
+ )
92
+ return res
93
+
94
+ async def _map_characters_to_voices(
95
  self, text_split: SplitTextOutput
96
  ) -> SelectVoiceChainOutput:
97
+ chain = self.voice_selector.create_voice_mapping_chain(llm_model=GPTModels.GPT_4o)
 
 
98
  with get_openai_callback() as cb:
99
  chain_out = await chain.ainvoke(
100
  {
 
103
  },
104
  config={"callbacks": [LCMessageLoggerAsync()]},
105
  )
106
+ logger.info(f'end of mapping characters to voices. openai callback stats: {cb}')
107
  return chain_out
108
 
109
+ async def _prepare_params_for_tts(self, text_split: SplitTextOutput) -> list[TTSParams]:
110
+ semaphore = asyncio.Semaphore(OPENAI_MAX_PARALLEL)
111
+
112
+ async def run_task_with_semaphore(func, **params):
113
+ async with semaphore:
114
+ outputs = await func(**params)
115
+ return outputs
116
+
117
+ tasks = []
118
+
119
+ for character_phrase in text_split.phrases:
120
+ tasks.append(
121
+ run_task_with_semaphore(
122
+ func=self.params_tts_processor.run,
123
+ text=character_phrase.text,
124
+ )
125
+ )
126
+
127
+ tts_tasks_results = await asyncio.gather(*tasks)
128
+
129
+ return tts_tasks_results
130
+
131
+ @staticmethod
132
+ def _add_voice_ids_to_tts_params(
133
+ text_split: SplitTextOutput,
134
+ tts_params_list: list[TTSParams],
135
+ character2voice: dict[str, str],
136
+ ) -> list[TTSParams]:
137
+ for character_phrase, params in zip(text_split.phrases, tts_params_list):
138
+ params.voice_id = character2voice[character_phrase.character]
139
+ return tts_params_list
140
+
141
+ @staticmethod
142
+ def _get_left_and_right_contexts_for_each_phrase(
143
+ phrases, context_length=CONTEXT_CHAR_LEN_FOR_TTS
144
+ ):
145
+ """
146
+ Return phrases from left and right sides which don't exceed `context_length`.
147
+ Approx. number of words/tokens based on `context_length` can be calculated by dividing it by 5.
148
+ """
149
+ # TODO: split first context phrase if it exceeds `context_length`, currently it's not added.
150
+ # TODO: optimize algorithm to linear time using sliding window on top of cumulative length sums.
151
+ left_right_contexts = []
152
+ for i in range(len(phrases)):
153
+ left_text, right_text = '', ''
154
+ for j in range(i - 1, -1, -1):
155
+ if len(left_text) + len(phrases[j].text) < context_length:
156
+ left_text = phrases[j].text + left_text
157
+ else:
158
+ break
159
+ for phrase in phrases[i + 1 :]:
160
+ if len(right_text) + len(phrase.text) < context_length:
161
+ right_text += phrase.text
162
+ else:
163
+ break
164
+ left_right_contexts.append((left_text, right_text))
165
+ return left_right_contexts
166
+
167
+ def _add_previous_and_next_context_to_tts_params(
168
+ self,
169
+ text_split: SplitTextOutput,
170
+ tts_params_list: list[TTSParams],
171
+ ) -> list[TTSParams]:
172
+ left_right_contexts = self._get_left_and_right_contexts_for_each_phrase(text_split.phrases)
173
+ for cur_contexts, params in zip(left_right_contexts, tts_params_list):
174
+ left_context, right_context = cur_contexts
175
+ params.previous_text = left_context
176
+ params.next_text = right_context
177
+ return tts_params_list
178
+
179
+ @staticmethod
180
+ async def _generate_tts_audio(
181
+ tts_params_list: list[TTSParams],
182
+ out_dp: str,
183
+ ) -> TTSPhrasesGenerationOutput:
184
+ semaphore = asyncio.Semaphore(ELEVENLABS_MAX_PARALLEL)
185
+
186
+ async def _tts_with_semaphore(params: TTSParams) -> TTSTimestampsResponse:
187
+ async with semaphore:
188
+ return await tts.tts_w_timestamps(params=params)
189
+
190
+ tasks = [_tts_with_semaphore(params=params) for params in tts_params_list]
191
+ tts_responses: list[TTSTimestampsResponse] = await asyncio.gather(*tasks)
192
+
193
+ tts_audio_fps = []
194
+ for ix, (params, res) in enumerate(zip(tts_params_list, tts_responses), start=1):
195
+ out_fp_no_ext = os.path.join(out_dp, f'tts_output_{ix}')
196
+ out_fp = res.write_audio_to_file(
197
+ filepath_no_ext=out_fp_no_ext, audio_format=params.output_format
198
+ )
199
+ tts_audio_fps.append(out_fp)
200
+
201
+ # combine alignments
202
+ alignments = [response.alignment for response in tts_responses]
203
+ char2time = TTSTimestampsAlignment.combine_alignments(alignments=alignments)
204
+ # filter alignments
205
+ char2time = char2time.filter_chars_without_duration()
206
+
207
+ return TTSPhrasesGenerationOutput(audio_fps=tts_audio_fps, char2time=char2time)
208
+
209
+ def _update_sound_effects_descriptions_with_durations(
210
+ self,
211
+ sound_effects_descriptions: list[SoundEffectDescription],
212
+ char2time: TTSTimestampsAlignment,
213
+ ) -> list[SoundEffectDescription]:
214
+ for sed in sound_effects_descriptions:
215
+ ix_start, ix_end = sed.ix_start_orig_text, sed.ix_end_orig_text
216
+ time_start = char2time.get_start_time_by_char_ix(ix_start, safe=True)
217
+ time_end = char2time.get_end_time_by_char_ix(ix_end, safe=True)
218
+ duration = time_end - time_start
219
+ # apply min effect duration
220
+ duration = max(self.min_sound_effect_duration_sec, duration)
221
+ # update inplace
222
+ sed.start_sec = time_start
223
+ sed.duration_sec = duration
224
+ return sound_effects_descriptions
225
+
226
+ # def _filter_short_sound_effects(
227
+ # self,
228
+ # sound_effects_descriptions: list[SoundEffectDescription],
229
+ # ) -> list[SoundEffectDescription]:
230
+ # filtered = [
231
+ # sed
232
+ # for sed in sound_effects_descriptions
233
+ # if sed.duration_sec > self.min_sound_effect_duration_sec
234
+ # ]
235
+
236
+ # len_orig = len(sound_effects_descriptions)
237
+ # len_new = len(filtered)
238
+ # logger.info(
239
+ # f'{len_new} out of {len_orig} original sound effects are kept '
240
+ # f'after filtering by min duration: {self.min_sound_effect_duration_sec}'
241
+ # )
242
+
243
+ # return filtered
244
+
245
+ def _sound_effects_description_2_generation_params(
246
+ self,
247
+ sound_effects_descriptions: list[SoundEffectDescription],
248
+ ) -> list[SoundEffectsParams]:
249
+ params = [
250
+ SoundEffectsParams(
251
+ text=sed.prompt,
252
+ duration_seconds=sed.duration_sec,
253
+ prompt_influence=self.sound_effects_prompt_influence,
254
+ )
255
+ for sed in sound_effects_descriptions
256
+ ]
257
+ return params
258
+
259
+ @staticmethod
260
+ async def _generate_sound_effects(
261
+ sound_effects_params: list[SoundEffectsParams],
262
+ out_dp: str,
263
+ ) -> list[str]:
264
+ semaphore = asyncio.Semaphore(ELEVENLABS_MAX_PARALLEL)
265
+
266
+ async def _se_gen_with_semaphore(params: SoundEffectsParams) -> list[bytes]:
267
+ async with semaphore:
268
+ return await tts.sound_generation_consumed(params=params)
269
+
270
+ tasks = [_se_gen_with_semaphore(params=params) for params in sound_effects_params]
271
+ results = await asyncio.gather(*tasks)
272
+
273
+ se_fps = []
274
+ for ix, task_res in enumerate(results, start=1):
275
+ out_fp = os.path.join(out_dp, f'sound_effect_{ix}.wav')
276
+ utils.write_chunked_bytes(data=task_res, fp=out_fp)
277
+ se_fps.append(out_fp)
278
+
279
+ return se_fps
280
+
281
+ @staticmethod
282
+ def _save_text_split_debug_data(
283
+ text_split: SplitTextOutput,
284
+ out_dp: str,
285
+ ):
286
+ out_fp = os.path.join(out_dp, 'text_split.json')
287
+ # NOTE: use `to_dict()` for correct conversion
288
+ data = text_split.model_dump()
289
+ utils.write_json(data, fp=out_fp)
290
+
291
+ @staticmethod
292
+ def _save_tts_debug_data(
293
+ tts_params_list: list[TTSParams],
294
+ tts_out: TTSPhrasesGenerationOutput,
295
+ out_dp: str,
296
+ ):
297
+ out_fp = os.path.join(out_dp, 'tts.json')
298
+ # NOTE: use `to_dict()` for correct conversion
299
+ data = [param.to_dict() for param in tts_params_list]
300
+ utils.write_json(data, fp=out_fp)
301
+
302
+ out_dp = os.path.join(out_dp, 'tts_char2time.csv')
303
+ df_char2time = tts_out.char2time.to_dataframe()
304
+ df_char2time.to_csv(out_dp, index=True)
305
+
306
+ @staticmethod
307
+ def _save_sound_effects_debug_data(
308
+ sound_effect_design_output: SoundEffectsDesignOutput,
309
+ sound_effect_descriptions: list[SoundEffectDescription],
310
+ out_dp: str,
311
+ ):
312
+ out_fp = os.path.join(out_dp, 'sound_effects_raw_llm_output.txt')
313
+ utils.write_txt(sound_effect_design_output.text_annotated, fp=out_fp)
314
+
315
+ out_fp = os.path.join(out_dp, 'sound_effects_descriptions.json')
316
+ data = [sed.model_dump() for sed in sound_effect_descriptions]
317
+ utils.write_json(data, fp=out_fp)
318
+
319
+ @staticmethod
320
+ def _postprocess_tts_audio(audio_fps: list[str], out_dp: str, target_dBFS: float) -> list[str]:
321
+ fps = []
322
+ for in_fp in audio_fps:
323
+ audio_segment = AudioSegment.from_file(in_fp)
324
+ normalized_audio = utils.normalize_audio(audio_segment, target_dBFS)
325
+
326
+ out_fp = os.path.join(out_dp, f"{Path(in_fp).stem}.normalized.wav")
327
+ normalized_audio.export(out_fp, format="wav")
328
+ fps.append(out_fp)
329
+
330
+ return fps
331
+
332
+ @staticmethod
333
+ def _postprocess_sound_effects(
334
+ audio_fps: list[str], out_dp: str, target_dBFS: float, fade_ms: int
335
+ ) -> list[str]:
336
+ fps = []
337
+ for in_fp in audio_fps:
338
+ audio_segment = AudioSegment.from_file(in_fp)
339
+
340
+ processed = utils.normalize_audio(audio_segment, target_dBFS)
341
+
342
+ processed = processed.fade_in(duration=fade_ms)
343
+ processed = processed.fade_out(duration=fade_ms)
344
+
345
+ out_fp = os.path.join(out_dp, f"{Path(in_fp).stem}.postprocessed.wav")
346
+ processed.export(out_fp, format="wav")
347
+ fps.append(out_fp)
348
+
349
+ return fps
350
+
351
+ @staticmethod
352
+ def _concatenate_audiofiles(audio_fps: list[str], out_wav_fp: str):
353
+ concat = AudioSegment.from_file(audio_fps[0])
354
+ for filename in audio_fps[1:]:
355
+ next_audio = AudioSegment.from_file(filename)
356
+ concat += next_audio
357
+ logger.info(f'saving concatenated audiobook to: "{out_wav_fp}"')
358
+ concat.export(out_wav_fp, format="wav")
359
+
360
+ def _get_text_split_html(
361
+ self,
362
+ text_split: SplitTextOutput,
363
+ sound_effects_descriptions: list[SoundEffectDescription] | None,
364
+ ):
365
+ # modify copies of original phrases, keep original intact
366
+ character_phrases = [p.model_copy(deep=True) for p in text_split.phrases]
367
+ for phrase in character_phrases:
368
+ phrase.character = prettify_unknown_character_label(phrase.character)
369
+
370
+ if not sound_effects_descriptions:
371
+ inner = generate_text_split_inner_html_no_effect(character_phrases=character_phrases)
372
+ else:
373
+ inner = generate_text_split_inner_html_with_effects(
374
+ character_phrases=character_phrases,
375
+ sound_effects_descriptions=sound_effects_descriptions,
376
+ )
377
+
378
+ final = self.html_generator.generate_text_split(inner)
379
+ return final
380
+
381
+ def _get_voice_mapping_html(
382
+ self, use_user_voice: bool, select_voice_chain_out: SelectVoiceChainOutput
383
+ ):
384
+ if use_user_voice:
385
+ return ''
386
+ inner = generate_voice_mapping_inner_html(select_voice_chain_out)
387
+ final = self.html_generator.generate_voice_assignments(inner)
388
+ return final
389
+
390
+ STAGE_1 = 'Text Analysis'
391
+ STAGE_2 = 'Voices Selection'
392
+ STAGE_3 = 'Audio Generation'
393
+
394
+ def _get_yield_data_stage_0(self):
395
+ status = self.html_generator.generate_status("Starting", [("Analyzing Text...", False)])
396
+ return None, "", status
397
+
398
+ def _get_yield_data_stage_1(self, text_split_html: str):
399
+ status_html = create_status_html(
400
+ "Text Analysis Complete",
401
+ [(self.STAGE_1, True), ("Selecting Voices...", False)],
402
+ )
403
+ html = status_html + text_split_html
404
+ return None, "", html
405
+
406
+ def _get_yield_data_stage_2(self, text_split_html: str, voice_mapping_html: str):
407
+ status_html = create_status_html(
408
+ "Voice Selection Complete",
409
+ [(self.STAGE_1, True), (self.STAGE_2, True), ("Generating Audio...", False)],
410
+ )
411
+ html = status_html + text_split_html + voice_mapping_html + '</div>'
412
+ return None, "", html
413
+
414
+ def _get_yield_data_stage_3(
415
+ self, final_audio_fp: str, text_split_html: str, voice_mapping_html: str
416
+ ):
417
+ status_html = create_status_html(
418
+ "Audiobook is ready ✨",
419
+ [(self.STAGE_1, True), (self.STAGE_2, True), (self.STAGE_3, True)],
420
  )
421
+ third_stage_result_html = (
422
+ status_html
423
+ + text_split_html
424
+ + voice_mapping_html
425
+ + self.html_generator.generate_final_message()
426
+ + '</div>'
427
  )
428
+ return final_audio_fp, "", third_stage_result_html
429
+
430
+ async def run(
431
+ self,
432
+ text: str,
433
+ generate_effects: bool,
434
+ use_user_voice: bool = False,
435
+ voice_id: str | None = None,
436
+ ):
437
+ now_str = utils.get_utc_now_str()
438
+ uuid_trimmed = str(uuid4()).split('-')[0]
439
+ dir_name = f'{now_str}-{uuid_trimmed}'
440
+ out_dp_root = os.path.join('data', 'audiobooks', dir_name)
441
+ os.makedirs(out_dp_root, exist_ok=False)
442
+
443
+ debug_dp = os.path.join(out_dp_root, 'debug')
444
+ os.makedirs(debug_dp)
445
+
446
+ # TODO: currently, we are constantly writing and reading audio segments from files.
447
+ # I think it will be more efficient to keep all audio in memory.
448
+
449
+ # zero stage
450
+ if use_user_voice and not voice_id:
451
+ yield None, "", self.html_generator.generate_message_without_voice_id()
452
+
453
+ else:
454
+ yield self._get_yield_data_stage_0()
455
+
456
+ text_for_tts = await self._prepare_text_for_tts(text=text)
457
+
458
+ # TODO: call sound effects chain in parallel with text split chain
459
+ text_split = await self._split_text(text=text_for_tts)
460
+ self._save_text_split_debug_data(text_split=text_split, out_dp=debug_dp)
461
+ # yield stage 1
462
+ text_split_html = self._get_text_split_html(
463
+ text_split=text_split, sound_effects_descriptions=None
464
+ )
465
+ yield self._get_yield_data_stage_1(text_split_html=text_split_html)
466
+
467
+ if generate_effects:
468
+ se_design_output = await self._design_sound_effects(text=text_for_tts)
469
+ se_descriptions = se_design_output.sound_effects_descriptions
470
+ text_split_html = self._get_text_split_html(
471
+ text_split=text_split, sound_effects_descriptions=se_descriptions
472
+ )
473
+
474
+ # TODO: run voice mapping and tts params selection in parallel
475
+ if not use_user_voice:
476
+ select_voice_chain_out = await self._map_characters_to_voices(text_split=text_split)
477
+ else:
478
+ if voice_id is None:
479
+ raise ValueError(f'voice_id is None')
480
+ select_voice_chain_out = SelectVoiceChainOutput(
481
+ character2props={
482
+ char: CharacterPropertiesNullable(gender=None, age_group=None)
483
+ for char in text_split.characters
484
+ },
485
+ character2voice={char: voice_id for char in text_split.characters},
486
+ )
487
+ tts_params_list = await self._prepare_params_for_tts(text_split=text_split)
488
+
489
+ # yield stage 2
490
+ voice_mapping_html = self._get_voice_mapping_html(
491
+ use_user_voice=use_user_voice, select_voice_chain_out=select_voice_chain_out
492
+ )
493
+ yield self._get_yield_data_stage_2(
494
+ text_split_html=text_split_html, voice_mapping_html=voice_mapping_html
495
+ )
496
+
497
+ tts_params_list = self._add_voice_ids_to_tts_params(
498
+ text_split=text_split,
499
+ tts_params_list=tts_params_list,
500
+ character2voice=select_voice_chain_out.character2voice,
501
+ )
502
+
503
+ tts_params_list = self._add_previous_and_next_context_to_tts_params(
504
+ text_split=text_split,
505
+ tts_params_list=tts_params_list,
506
+ )
507
+
508
+ tts_dp = os.path.join(out_dp_root, 'tts')
509
+ os.makedirs(tts_dp)
510
+ tts_out = await self._generate_tts_audio(tts_params_list=tts_params_list, out_dp=tts_dp)
511
+
512
+ self._save_tts_debug_data(
513
+ tts_params_list=tts_params_list, tts_out=tts_out, out_dp=debug_dp
514
+ )
515
+
516
+ if generate_effects:
517
+ se_descriptions = self._update_sound_effects_descriptions_with_durations(
518
+ sound_effects_descriptions=se_descriptions, char2time=tts_out.char2time
519
+ )
520
+
521
+ # no need in filtering, since we ensure the min duration above
522
+ # se_descriptions = self._filter_short_sound_effects(
523
+ # sound_effects_descriptions=se_descriptions
524
+ # )
525
+
526
+ se_params = self._sound_effects_description_2_generation_params(
527
+ sound_effects_descriptions=se_descriptions
528
+ )
529
+
530
+ if len(se_descriptions) != len(se_params):
531
+ raise ValueError(
532
+ f'expected {len(se_descriptions)} sound effects params, got: {len(se_params)}'
533
+ )
534
+
535
+ effects_dp = os.path.join(out_dp_root, 'sound_effects')
536
+ os.makedirs(effects_dp)
537
+ se_fps = await self._generate_sound_effects(
538
+ sound_effects_params=se_params, out_dp=effects_dp
539
+ )
540
+
541
+ if len(se_descriptions) != len(se_fps):
542
+ raise ValueError(
543
+ f'expected {len(se_descriptions)} generated sound effects, got: {len(se_fps)}'
544
+ )
545
+
546
+ self._save_sound_effects_debug_data(
547
+ sound_effect_design_output=se_design_output,
548
+ sound_effect_descriptions=se_descriptions,
549
+ out_dp=debug_dp,
550
+ )
551
+
552
+ tts_normalized_dp = os.path.join(out_dp_root, 'tts_normalized')
553
+ os.makedirs(tts_normalized_dp)
554
+ tts_norm_fps = self._postprocess_tts_audio(
555
+ audio_fps=tts_out.audio_fps,
556
+ out_dp=tts_normalized_dp,
557
+ target_dBFS=-20,
558
+ )
559
+
560
+ if generate_effects:
561
+ se_normalized_dp = os.path.join(out_dp_root, 'sound_effects_postprocessed')
562
+ os.makedirs(se_normalized_dp)
563
+ se_norm_fps = self._postprocess_sound_effects(
564
+ audio_fps=se_fps,
565
+ out_dp=se_normalized_dp,
566
+ target_dBFS=-27,
567
+ fade_ms=500,
568
+ )
569
+
570
+ tts_concat_fp = os.path.join(out_dp_root, f'audiobook_{now_str}.wav')
571
+ self._concatenate_audiofiles(audio_fps=tts_norm_fps, out_wav_fp=tts_concat_fp)
572
+
573
+ if not generate_effects:
574
+ final_audio_fp = tts_concat_fp
575
+ else:
576
+ tts_concat_with_effects_fp = os.path.join(
577
+ out_dp_root, f'audiobook_with_effects_{now_str}.wav'
578
+ )
579
+ se_starts_sec = [sed.start_sec for sed in se_descriptions]
580
+ utils.overlay_multiple_audio(
581
+ main_audio_fp=tts_concat_fp,
582
+ audios_to_overlay_fps=se_norm_fps,
583
+ starts_sec=se_starts_sec,
584
+ out_fp=tts_concat_with_effects_fp,
585
+ )
586
+ final_audio_fp = tts_concat_with_effects_fp
587
+
588
+ utils.rm_dir_conditional(dp=out_dp_root, to_remove=self.rm_artifacts)
589
+
590
+ # yield stage 3
591
+ yield self._get_yield_data_stage_3(
592
+ final_audio_fp=final_audio_fp,
593
+ text_split_html=text_split_html,
594
+ voice_mapping_html=voice_mapping_html,
595
+ )
596
+
597
+ logger.info(f'end of {self.name}.run()')
src/config.py CHANGED
@@ -1,5 +1,5 @@
1
- import os
2
  import logging
 
3
 
4
  logging.basicConfig(
5
  level=logging.INFO,
@@ -12,8 +12,11 @@ ELEVENLABS_API_KEY = os.environ["ELEVEN_LABS_API_KEY"]
12
 
13
  FILE_SIZE_MAX = 0.5 # in mb
14
 
15
- OPENAI_MAX_PARALLEL = 8 # empirically set
16
- ELEVENLABS_MAX_PARALLEL = 15 # current limitation of available subscription
 
 
 
17
 
18
  # VOICES_CSV_FP = "data/11labs_available_tts_voices.csv"
19
  VOICES_CSV_FP = "data/11labs_available_tts_voices.reviewed.csv"
@@ -29,8 +32,15 @@ All you need to do - is to input the book text or select it from the provided Sa
29
 
30
  AI will do the rest:
31
  - split text into characters
32
- - assign each character a voice
33
  - preprocess text to better convey emotions during Text-to-Speech
34
  - (optionally) add sound effects to create immersive atmosphere
35
  - generate audiobook using Text-to-Speech model
36
  """
 
 
 
 
 
 
 
 
 
1
  import logging
2
+ import os
3
 
4
  logging.basicConfig(
5
  level=logging.INFO,
 
12
 
13
  FILE_SIZE_MAX = 0.5 # in mb
14
 
15
+ OPENAI_MAX_PARALLEL = 10 # empirically set
16
+
17
+ # current limitation of available subscription.
18
+ # see: https://elevenlabs.io/docs/api-reference/text-to-speech#generation-and-concurrency-limits
19
+ ELEVENLABS_MAX_PARALLEL = 15
20
 
21
  # VOICES_CSV_FP = "data/11labs_available_tts_voices.csv"
22
  VOICES_CSV_FP = "data/11labs_available_tts_voices.reviewed.csv"
 
32
 
33
  AI will do the rest:
34
  - split text into characters
35
+ - select voice for each character
36
  - preprocess text to better convey emotions during Text-to-Speech
37
  - (optionally) add sound effects to create immersive atmosphere
38
  - generate audiobook using Text-to-Speech model
39
  """
40
+
41
+ DEFAULT_TTS_STABILITY = 0.5
42
+ DEFAULT_TTS_STABILITY_ACCEPTABLE_RANGE = (0.3, 0.8)
43
+ DEFAULT_TTS_SIMILARITY_BOOST = 0.5
44
+ DEFAULT_TTS_STYLE = 0.0
45
+
46
+ CONTEXT_CHAR_LEN_FOR_TTS = 500
src/emotions/generation.py DELETED
@@ -1,208 +0,0 @@
1
- import json
2
- import typing as t
3
- from abc import ABC, abstractmethod
4
-
5
- import openai
6
- from pydantic import BaseModel
7
- from requests import HTTPError
8
-
9
- from src.config import OPENAI_API_KEY, logger
10
- from src.utils import auto_retry
11
-
12
- from .prompts import (
13
- SOUND_EFFECT_GENERATION,
14
- SOUND_EFFECT_GENERATION_WITHOUT_DURATION_PREDICTION,
15
- TEXT_MODIFICATION,
16
- TEXT_MODIFICATION_WITH_SSML,
17
- )
18
- from .utils import get_audio_duration
19
-
20
-
21
- class TextPreparationForTTSTaskOutput(BaseModel):
22
- task: str
23
- output: t.Any
24
-
25
-
26
- class AbstractEffectGenerator(ABC):
27
- @abstractmethod
28
- async def generate_text_for_sound_effect(self, text) -> dict:
29
- pass
30
-
31
- @abstractmethod
32
- async def generate_parameters_for_sound_effect(
33
- self, text: str, generated_audio_file: str | None
34
- ) -> TextPreparationForTTSTaskOutput:
35
- pass
36
-
37
- @abstractmethod
38
- async def add_emotion_to_text(self, text: str) -> TextPreparationForTTSTaskOutput:
39
- pass
40
-
41
-
42
- # class EffectGenerator(AbstractEffectGenerator):
43
- # def __init__(self, predict_duration: bool = True, model_type: str = "gpt-4o"):
44
- # self.client = openai.OpenAI(api_key=OPENAI_API_KEY)
45
- # self.sound_effect_prompt = (
46
- # SOUND_EFFECT_GENERATION
47
- # if predict_duration
48
- # else SOUND_EFFECT_GENERATION_WITHOUT_DURATION_PREDICTION
49
- # )
50
- # self.text_modification_prompt = TEXT_MODIFICATION_WITH_SSML
51
- # self.model_type = model_type
52
- # logger.info(
53
- # f"EffectGenerator initialized with model_type: {model_type}, predict_duration: {predict_duration}"
54
- # )
55
-
56
- # @auto_retry
57
- # def generate_text_for_sound_effect(self, text: str) -> dict:
58
- # """Generate sound effect description and parameters based on input text."""
59
- # try:
60
- # completion = self.client.chat.completions.create(
61
- # model=self.model_type,
62
- # messages=[
63
- # {"role": "system", "content": self.sound_effect_prompt},
64
- # {"role": "user", "content": text},
65
- # ],
66
- # response_format={"type": "json_object"},
67
- # )
68
- # # Extracting the output
69
- # chatgpt_output = completion.choices[0].message.content
70
-
71
- # # Parse and return JSON response
72
- # output_dict = json.loads(chatgpt_output)
73
- # logger.info(
74
- # "Successfully generated sound effect description: %s", output_dict
75
- # )
76
- # return output_dict
77
-
78
- # except json.JSONDecodeError as e:
79
- # logger.error("Failed to parse the output text as JSON: %s", e)
80
- # raise RuntimeError(
81
- # f"Error: Failed to parse the output text as JSON.\nOutput: {chatgpt_output}"
82
- # )
83
-
84
- # except HTTPError as e:
85
- # logger.error("HTTP error occurred: %s", e)
86
- # raise RuntimeError(f"HTTP Error: {e}")
87
-
88
- # except Exception as e:
89
- # logger.error("Unexpected error occurred: %s", e)
90
- # raise RuntimeError(f"Unexpected Error: {e}")
91
-
92
- # @auto_retry
93
- # def generate_parameters_for_sound_effect(
94
- # self, text: str, generated_audio_file: str = None
95
- # ) -> dict:
96
- # llm_output = self.generate_text_for_sound_effect(text)
97
- # if generated_audio_file is not None:
98
- # llm_output["duration_seconds"] = get_audio_duration(generated_audio_file)
99
- # logger.info(
100
- # "Added duration_seconds to output based on generated audio file: %s",
101
- # generated_audio_file,
102
- # )
103
- # return llm_output
104
-
105
- # @auto_retry
106
- # def add_emotion_to_text(self, text: str) -> dict:
107
- # completion = self.client.chat.completions.create(
108
- # model=self.model_type,
109
- # messages=[
110
- # {"role": "system", "content": self.text_modification_prompt},
111
- # {"role": "user", "content": text},
112
- # ],
113
- # response_format={"type": "json_object"},
114
- # )
115
- # chatgpt_output = completion.choices[0].message.content
116
- # try:
117
- # output_dict = json.loads(chatgpt_output)
118
- # logger.info(
119
- # "Successfully modified text with emotional cues: %s", output_dict
120
- # )
121
- # return output_dict
122
- # except json.JSONDecodeError as e:
123
- # logger.error("Error in parsing the modified text: %s", e)
124
- # raise f"error, output_text: {chatgpt_output}"
125
-
126
-
127
- class EffectGeneratorAsync(AbstractEffectGenerator):
128
- def __init__(self, predict_duration: bool, model_type: str = "gpt-4o"):
129
- self.client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)
130
- self.sound_effect_prompt = (
131
- SOUND_EFFECT_GENERATION
132
- if predict_duration
133
- else SOUND_EFFECT_GENERATION_WITHOUT_DURATION_PREDICTION
134
- )
135
- self.text_modification_prompt = TEXT_MODIFICATION_WITH_SSML
136
- self.model_type = model_type
137
-
138
- @auto_retry
139
- async def generate_text_for_sound_effect(self, text: str) -> dict:
140
- """Asynchronous version to generate sound effect description."""
141
- try:
142
- completion = await self.client.chat.completions.create(
143
- model=self.model_type,
144
- messages=[
145
- {"role": "system", "content": self.sound_effect_prompt},
146
- {"role": "user", "content": text},
147
- ],
148
- response_format={"type": "json_object"},
149
- )
150
- # Extracting the output
151
- chatgpt_output = completion.choices[0].message.content
152
-
153
- # Parse and return JSON response
154
- output_dict = json.loads(chatgpt_output)
155
- logger.info(
156
- "Successfully generated sound effect description: %s", output_dict
157
- )
158
- return output_dict
159
-
160
- except json.JSONDecodeError as e:
161
- logger.error("Failed to parse the output text as JSON: %s", e)
162
- raise RuntimeError(
163
- f"Error: Failed to parse the output text as JSON.\nOutput: {chatgpt_output}"
164
- )
165
-
166
- except HTTPError as e:
167
- logger.error("HTTP error occurred: %s", e)
168
- raise RuntimeError(f"HTTP Error: {e}")
169
-
170
- except Exception as e:
171
- logger.error("Unexpected error occurred: %s", e)
172
- raise RuntimeError(f"Unexpected Error: {e}")
173
-
174
- @auto_retry
175
- async def generate_parameters_for_sound_effect(
176
- self, text: str, generated_audio_file: str | None = None
177
- ) -> TextPreparationForTTSTaskOutput:
178
- llm_output = await self.generate_text_for_sound_effect(text)
179
- if generated_audio_file is not None:
180
- llm_output["duration_seconds"] = get_audio_duration(generated_audio_file)
181
- logger.info(
182
- "Added duration_seconds to output based on generated audio file: %s",
183
- generated_audio_file,
184
- )
185
- return TextPreparationForTTSTaskOutput(task="add_effects", output=llm_output)
186
-
187
- @auto_retry
188
- async def add_emotion_to_text(self, text: str) -> TextPreparationForTTSTaskOutput:
189
- completion = await self.client.chat.completions.create(
190
- model=self.model_type,
191
- messages=[
192
- {"role": "system", "content": self.text_modification_prompt},
193
- {"role": "user", "content": text},
194
- ],
195
- response_format={"type": "json_object"},
196
- )
197
- chatgpt_output = completion.choices[0].message.content
198
- try:
199
- output_dict = json.loads(chatgpt_output)
200
- logger.info(
201
- "Successfully modified text with emotional cues: %s", output_dict
202
- )
203
- return TextPreparationForTTSTaskOutput(
204
- task="add_emotion", output=output_dict
205
- )
206
- except json.JSONDecodeError as e:
207
- logger.error("Error in parsing the modified text: %s", e)
208
- raise f"error, output_text: {chatgpt_output}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/emotions/prompts.py DELETED
@@ -1,160 +0,0 @@
1
- PREFIX = """\
2
- You should help me to make an audiobook with realistic emotion sound using TTS.
3
- You are tasked with generating a description of sound effects
4
- that matches the atmosphere, actions, and tone of a given sentence or text from a book.
5
- The description should be tailored to create a sound effect using ElevenLabs'sound generation API.
6
- The generated sound description must evoke the scene
7
- or emotions from the text (e.g., footsteps, wind, tense silence, etc.),
8
- and it should be succinct and fit the mood of the text."""
9
-
10
- SOUND_EFFECT_GENERATION = f"""
11
- {PREFIX}
12
-
13
- Additionally, you should include the following parameters in your response:
14
-
15
- Text: A generated description of the sound that matches the text provided.
16
- Keep the description simple and effective to capture the soundscape.
17
- This text will be converted into a sound effect.
18
- Duration_seconds: The appropriate duration of the sound effect,
19
- which should be calculated based on the length and nature of the scene.
20
- Cap this duration at 22 seconds. But be carefully, for very long text in input make a long sound effect,
21
- for small make a small one. And the duration should be similar to duration of input text
22
- Prompt_influence: A value between 0 and 1, where a higher value makes the sound generation closely
23
- follow the sound description. For general sound effects (e.g., footsteps, background ambiance),
24
- use a value around 0.3. For more specific or detailed sound scenes
25
- (e.g., thunderstorm, battle sounds), use a higher value like 0.5 to 0.7.
26
-
27
- Your output should be in the following JSON format:
28
-
29
- {{
30
- "text": "A soft breeze rustling through leaves, distant birds chirping.",
31
- "duration_seconds": 4.0,
32
- "prompt_influence": 0.4
33
- }}
34
-
35
- NOTES:
36
- - NEVER add any speech or voices in your instructions!
37
- - NEVER add any music in your instructions!
38
- - NEVER add city sounds, car honks in your instructions!
39
- - make your text descriptions VERY SPECIFIC, AVOID vague instructions.
40
- If it's necessary, you can use couple sentences to formulate the instruction.
41
- But remember to use keep instructions simple.
42
- - aim to create specific sounds, like crackling fireplace, footsteps, wind, etc...
43
- """
44
-
45
- SOUND_EFFECT_GENERATION_WITHOUT_DURATION_PREDICTION = f"""
46
- {PREFIX}
47
-
48
- Additionally, you should include the following parameters in your response:
49
-
50
- Text: A generated description of the sound that matches the text provided.
51
- Keep the description simple and effective to capture the soundscape.
52
- This text will be converted into a sound effect.
53
- Prompt_influence: A value between 0 and 1, where a higher value makes the sound generation closely
54
- follow the sound description. For general sound effects (e.g., footsteps, background ambiance),
55
- use a value around 0.3. For more specific or detailed sound scenes
56
- (e.g., thunderstorm, battle sounds), use a higher value like 0.5 to 0.7.
57
-
58
- Your output should be in the following JSON format:
59
-
60
- {{
61
- "text": "A soft breeze rustling through leaves, distant birds chirping.",
62
- "prompt_influence": 0.4
63
- }}"""
64
-
65
- TEXT_MODIFICATION = """
66
- You should help me to make an audiobook with realistic emotion-based voice using TTS.
67
- You are tasked with adjusting the emotional tone of a given text
68
- by modifying the text with special characters such as "!", "...", "-", "~",
69
- and uppercase words to add emphasis or convey emotion. For adding more emotion u can
70
- duplicate special characters for example "!!!".
71
- Do not remove or add any different words.
72
- Only alter the presentation of the existing words.
73
-
74
- Also you can add pause in the output text if it needed
75
- The most consistent way is programmatically using the syntax <break time="1.5s" />. or any time in second if it fit to the text
76
- This will create an exact and natural pause in the speech.
77
- It is not just added silence between words,
78
- but the AI has an actual understanding of this syntax and will add a natural pause.
79
-
80
- After modifying the text, adjust the "stability", "similarity_boost" and "style" parameters
81
- according to the level of emotional intensity in the modified text.
82
- Higher emotional intensity should lower the "stability" and raise the "similarity_boost".
83
- Your output should be in the following JSON format:
84
- {
85
- "modified_text": "Modified text with emotional adjustments.",
86
- "params": {
87
- "stability": 0.7,
88
- "similarity_boost": 0.5,
89
- "style": 0.3
90
- }
91
- }
92
-
93
- The "stability" parameter should range from 0 to 1,
94
- with lower values indicating a more expressive, less stable voice.
95
- The "similarity_boost" parameter should also range from 0 to 1,
96
- with higher values indicating more emphasis on the voice similarity.
97
- The "style" parameter should also range from 0 to 1,
98
- where lower values indicate a neutral tone and higher values reflect more stylized or emotional delivery.
99
- Adjust both according to the emotional intensity of the text.
100
-
101
- Example of text that could be passed:
102
-
103
- Text: "I can't believe this is happening."
104
- """
105
-
106
- TEXT_MODIFICATION_WITH_SSML = """
107
- You should help me to make an audiobook with overabundant emotion-based voice using TTS.
108
- You are tasked with transforming the text provided into a sophisticated SSML script
109
- that is optimized for emotionally, dramatically and breathtaking rich audiobook narration.
110
- Analyze the text for underlying emotions, detect nuances in intonation, and discern the intended impact.
111
- Apply suitable SSML enhancements to ensure that the final TTS output delivers
112
- a powerful, engaging, dramatic and breathtaking listening experience appropriate for an audiobook context
113
- (more effects/emotions are better than less)."
114
-
115
- Please, use only provided SSML tags and don't generate any other tags.
116
- Key SSML Tags to Utilize:
117
- <speak>: This is the root element. All SSML content to be synthesized must be enclosed within this tag.
118
- <prosody>: Manipulates pitch, rate, and volume to convey various emotions and emphases. Use this tag to adjust the voice to match the mood and tone of different parts of the narrative.
119
- <break>: Inserts pauses of specified durations. Use this to create natural breaks in speech, aiding in dramatic effect and better comprehension for listeners.
120
- <emphasis>: Adds stress to words or phrases to highlight key points or emotions, similar to vocal emphasis in natural speech.
121
- <p> and <s>: Structural tags that denote paragraphs and sentences, respectively. They help to manage the flow and pacing of the narrative appropriately.
122
-
123
- Input Text Example: "He stood there, gazing into the endless horizon. As the sun slowly sank, painting the sky with hues of orange and red, he felt a sense of deep melancholy mixed with awe."
124
-
125
- Modified text should be in the XML format. Expected SSML-enriched Output:
126
-
127
- <speak>
128
- <p>
129
- <s>
130
- He stood there, <prosody rate="slow" volume="soft">gazing into the endless horizon.</prosody>
131
- </s>
132
- <s>
133
- As the sun slowly <prosody rate="medium" pitch="-2st">sank,</prosody>
134
- <prosody volume="medium" pitch="+1st">painting the sky with hues of orange and red,</prosody>
135
- he felt a sense of deep <prosody volume="soft" pitch="-1st">melancholy</prosody> mixed with <emphasis level="moderate">awe.</emphasis>
136
- </s>
137
- </p>
138
- </speak>
139
-
140
- After modifying the text, adjust the "stability", "similarity_boost" and "style" parameters
141
- according to the level of emotional intensity in the modified text.
142
- Higher emotional intensity should lower the "stability" and raise the "similarity_boost".
143
- Your output should be in the following JSON format:
144
- {
145
- "modified_text": "Modified text in xml format with SSML tags.",
146
- "params": {
147
- "stability": 0.7,
148
- "similarity_boost": 0.5,
149
- "style": 0.3
150
- }
151
- }
152
-
153
- The "stability" parameter should range from 0 to 1,
154
- with lower values indicating a more expressive, less stable voice.
155
- The "similarity_boost" parameter should also range from 0 to 1,
156
- with higher values indicating more emphasis on the voice similarity.
157
- The "style" parameter should also range from 0 to 1,
158
- where lower values indicate a neutral tone and higher values reflect more stylized or emotional delivery.
159
- Adjust both according to the emotional intensity of the text.
160
- """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/emotions/utils.py DELETED
@@ -1,75 +0,0 @@
1
- from pydub import AudioSegment
2
- from pathlib import Path
3
- from elevenlabs import ElevenLabs, AsyncElevenLabs
4
- from elevenlabs import play, save
5
-
6
- from src.config import logger
7
-
8
-
9
- def get_audio_duration(filepath: str) -> float:
10
- """
11
- Returns the duration of the audio file in seconds.
12
-
13
- :param filepath: Path to the audio file.
14
- :return: Duration of the audio file in seconds.
15
- """
16
- audio = AudioSegment.from_file(filepath)
17
- duration_in_seconds = len(audio) / 1000 # Convert milliseconds to seconds
18
- return round(duration_in_seconds, 1)
19
-
20
-
21
- def add_overlay_for_audio(
22
- main_audio_filename: str,
23
- sound_effect_filename: str,
24
- output_filename: str = None,
25
- cycling_effect: bool = True,
26
- decrease_effect_volume: int = 0,
27
- ) -> str:
28
- try:
29
- main_audio = AudioSegment.from_file(main_audio_filename)
30
- effect_audio = AudioSegment.from_file(sound_effect_filename)
31
- except Exception as e:
32
- raise RuntimeError(f"Error loading audio files: {e}")
33
-
34
- if cycling_effect:
35
- while len(effect_audio) < len(main_audio):
36
- effect_audio += effect_audio
37
-
38
- effect_audio = effect_audio[: len(main_audio)]
39
-
40
- if decrease_effect_volume > 0:
41
- effect_audio = effect_audio - decrease_effect_volume
42
- combined_audio = main_audio.overlay(effect_audio)
43
-
44
- if output_filename is None:
45
- output_filename = (
46
- f"{Path(main_audio_filename).stem}_{Path(sound_effect_filename).stem}.wav"
47
- )
48
- combined_audio.export(output_filename, format="wav")
49
- return output_filename
50
-
51
-
52
- def sound_generation(sound_generation_data: dict, output_file: str):
53
- client = ElevenLabs(
54
- api_key="YOUR_API_KEY",
55
- )
56
- audio = client.text_to_sound_effects.convert(
57
- text=sound_generation_data["text"],
58
- duration_seconds=sound_generation_data["duration_seconds"],
59
- prompt_influence=sound_generation_data["prompt_influence"],
60
- )
61
- save(audio, output_file)
62
- logger.error("Successfully generated sound effect to file: %s", output_file)
63
-
64
-
65
- async def sound_generation_async(sound_generation_data: dict, output_file: str):
66
- client = AsyncElevenLabs(
67
- api_key="YOUR_API_KEY",
68
- )
69
- audio = await client.text_to_sound_effects.convert(
70
- text=sound_generation_data["text"],
71
- duration_seconds=sound_generation_data["duration_seconds"],
72
- prompt_influence=sound_generation_data["prompt_influence"],
73
- )
74
- save(audio, output_file)
75
- logger.error("Successfully generated sound effect to file: %s", output_file)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/generate_emotional_voice.py CHANGED
@@ -1,8 +1,9 @@
1
- from openai import OpenAI
2
  import json
 
3
  import requests
 
4
 
5
- client = OpenAI(api_key = '')
6
  PROMT = """
7
  You should help me to make an audiobook with realistic emotion-based voice using TTS.
8
  You are tasked with adjusting the emotional tone of a given text
@@ -45,12 +46,12 @@ He sat down on the couch, his hand tracing the empty space beside him where she
45
  He knew he would never see her smile again, never hear her voice and that was unbearable. Yet, he couldn’t reconcile himself with the fact that she was truly gone. 'How do I go on?' β€” he wondered, but there was no answer.
46
  """
47
 
 
48
  def generate_modified_text(text: str) -> dict:
49
  completion = client.chat.completions.create(
50
  model="gpt-4o",
51
- messages=[{"role": "system", "content": PROMT},
52
- {"role": "user", "content": text}],
53
- response_format={"type": "json_object"}
54
  )
55
  chatgpt_output = completion.choices[0].message.content
56
  try:
@@ -64,17 +65,9 @@ def generate_audio(text: str, params: dict, output_file: str):
64
  CHUNK_SIZE = 1024
65
  url = "https://api.elevenlabs.io/v1/text-to-speech/pMsXgVXv3BLzUgSXRplE"
66
 
67
- headers = {
68
- "Accept": "audio/mpeg",
69
- "Content-Type": "application/json",
70
- "xi-api-key": ""
71
- }
72
 
73
- data = {
74
- "text": text,
75
- "model_id": "eleven_monolingual_v1",
76
- "voice_settings": params
77
- }
78
 
79
  response = requests.post(url, json=data, headers=headers)
80
  with open(f'{output_file}.mp3', 'wb') as f:
@@ -82,13 +75,14 @@ def generate_audio(text: str, params: dict, output_file: str):
82
  if chunk:
83
  f.write(chunk)
84
 
 
85
  if __name__ == "__main__":
86
- default_param = {
87
- "stability": 0.5,
88
- "similarity_boost": 0.5,
89
- "style": 0.5
90
- }
91
  generate_audio(text_to_modified, default_param, "text_without_prompt")
92
  modified_text_with_params = generate_modified_text(text_to_modified)
93
  print(modified_text_with_params)
94
- generate_audio(modified_text_with_params['modified_text'], modified_text_with_params['params'], "text_with_prompt")
 
 
 
 
 
 
1
  import json
2
+
3
  import requests
4
+ from openai import OpenAI
5
 
6
+ client = OpenAI(api_key='')
7
  PROMT = """
8
  You should help me to make an audiobook with realistic emotion-based voice using TTS.
9
  You are tasked with adjusting the emotional tone of a given text
 
46
  He knew he would never see her smile again, never hear her voice and that was unbearable. Yet, he couldn’t reconcile himself with the fact that she was truly gone. 'How do I go on?' β€” he wondered, but there was no answer.
47
  """
48
 
49
+
50
  def generate_modified_text(text: str) -> dict:
51
  completion = client.chat.completions.create(
52
  model="gpt-4o",
53
+ messages=[{"role": "system", "content": PROMT}, {"role": "user", "content": text}],
54
+ response_format={"type": "json_object"},
 
55
  )
56
  chatgpt_output = completion.choices[0].message.content
57
  try:
 
65
  CHUNK_SIZE = 1024
66
  url = "https://api.elevenlabs.io/v1/text-to-speech/pMsXgVXv3BLzUgSXRplE"
67
 
68
+ headers = {"Accept": "audio/mpeg", "Content-Type": "application/json", "xi-api-key": ""}
 
 
 
 
69
 
70
+ data = {"text": text, "model_id": "eleven_monolingual_v1", "voice_settings": params}
 
 
 
 
71
 
72
  response = requests.post(url, json=data, headers=headers)
73
  with open(f'{output_file}.mp3', 'wb') as f:
 
75
  if chunk:
76
  f.write(chunk)
77
 
78
+
79
  if __name__ == "__main__":
80
+ default_param = {"stability": 0.5, "similarity_boost": 0.5, "style": 0.5}
 
 
 
 
81
  generate_audio(text_to_modified, default_param, "text_without_prompt")
82
  modified_text_with_params = generate_modified_text(text_to_modified)
83
  print(modified_text_with_params)
84
+ generate_audio(
85
+ modified_text_with_params['modified_text'],
86
+ modified_text_with_params['params'],
87
+ "text_with_prompt",
88
+ )
src/lc_callbacks.py CHANGED
@@ -1,9 +1,9 @@
1
  import typing as t
2
 
3
  from langchain_core.callbacks import AsyncCallbackHandler
 
4
  from langchain_core.outputs import ChatGeneration
5
  from langchain_core.outputs.llm_result import LLMResult
6
- from langchain_core.messages import BaseMessage
7
 
8
  from src.config import logger
9
 
@@ -45,13 +45,9 @@ class LCMessageLoggerAsync(AsyncCallbackHandler):
45
  """Run when LLM ends running."""
46
  generations = response.generations
47
  if len(generations) != 1:
48
- raise ValueError(
49
- f'expected "generations" to have len 1, got: {len(generations)}'
50
- )
51
  if len(generations[0]) != 1:
52
- raise ValueError(
53
- f'expected "generations[0]" to have len 1, got: {len(generations[0])}'
54
- )
55
 
56
  if self._log_raw_llm_response is True:
57
  gen: ChatGeneration = generations[0][0]
 
1
  import typing as t
2
 
3
  from langchain_core.callbacks import AsyncCallbackHandler
4
+ from langchain_core.messages import BaseMessage
5
  from langchain_core.outputs import ChatGeneration
6
  from langchain_core.outputs.llm_result import LLMResult
 
7
 
8
  from src.config import logger
9
 
 
45
  """Run when LLM ends running."""
46
  generations = response.generations
47
  if len(generations) != 1:
48
+ raise ValueError(f'expected "generations" to have len 1, got: {len(generations)}')
 
 
49
  if len(generations[0]) != 1:
50
+ raise ValueError(f'expected "generations[0]" to have len 1, got: {len(generations[0])}')
 
 
51
 
52
  if self._log_raw_llm_response is True:
53
  gen: ChatGeneration = generations[0][0]
src/preprocess_tts_emotions_chain.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+
3
+ import openai
4
+ from elevenlabs import VoiceSettings
5
+
6
+ from src.config import (
7
+ DEFAULT_TTS_SIMILARITY_BOOST,
8
+ DEFAULT_TTS_STABILITY,
9
+ DEFAULT_TTS_STABILITY_ACCEPTABLE_RANGE,
10
+ DEFAULT_TTS_STYLE,
11
+ OPENAI_API_KEY,
12
+ logger,
13
+ )
14
+ from src.prompts import EMOTION_STABILITY_MODIFICATION
15
+ from src.schemas import TTSParams
16
+ from src.utils import GPTModels, auto_retry
17
+
18
+
19
+ class TTSParamProcessor:
20
+
21
+ # TODO: refactor to langchain function (?)
22
+
23
+ def __init__(self):
24
+ self.client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)
25
+
26
+ @staticmethod
27
+ def _wrap_results(data: dict, default_text: str) -> TTSParams:
28
+ stability = data.get('stability', DEFAULT_TTS_STABILITY)
29
+ stability = max(stability, DEFAULT_TTS_STABILITY_ACCEPTABLE_RANGE[0])
30
+ stability = min(stability, DEFAULT_TTS_STABILITY_ACCEPTABLE_RANGE[1])
31
+
32
+ similarity_boost = DEFAULT_TTS_SIMILARITY_BOOST
33
+ style = DEFAULT_TTS_STYLE
34
+
35
+ params = TTSParams(
36
+ # NOTE: voice will be set later in the builder pipeline
37
+ voice_id='',
38
+ text=default_text,
39
+ # reference: https://elevenlabs.io/docs/speech-synthesis/voice-settings
40
+ voice_settings=VoiceSettings(
41
+ stability=stability,
42
+ similarity_boost=similarity_boost,
43
+ style=style,
44
+ use_speaker_boost=False,
45
+ ),
46
+ )
47
+ return params
48
+
49
+ @auto_retry
50
+ async def run(self, text: str) -> TTSParams:
51
+ text_prepared = text.strip()
52
+
53
+ completion = await self.client.chat.completions.create(
54
+ model=GPTModels.GPT_4o,
55
+ messages=[
56
+ {"role": "system", "content": EMOTION_STABILITY_MODIFICATION},
57
+ {"role": "user", "content": text_prepared},
58
+ ],
59
+ response_format={"type": "json_object"},
60
+ )
61
+ chatgpt_output = completion.choices[0].message.content
62
+ if chatgpt_output is None:
63
+ raise ValueError(f'received None as openai response content')
64
+
65
+ try:
66
+ output_dict = json.loads(chatgpt_output)
67
+ logger.info(f"TTS text processing succeeded: {output_dict}")
68
+ except json.JSONDecodeError as e:
69
+ logger.exception(f"Error in parsing LLM output: '{chatgpt_output}'")
70
+ raise e
71
+
72
+ output_wrapped = self._wrap_results(output_dict, default_text=text_prepared)
73
+ return output_wrapped
src/prompts.py CHANGED
@@ -1,92 +1,4 @@
1
- class SplitTextPromptV1:
2
- SYSTEM = """\
3
- You are a helpful assistant proficient in literature and language.
4
- Imagine you are helping to prepare the provided text for narration to create the audio book.
5
- We need to understand how many voice actors we need to hire and how to split the text between them.
6
-
7
- Your task is to help with this process, namely:
8
- 1. Identify all book characters occuring in the text, including "narrator".
9
- We will hire individual voice actor for each one of them.
10
- 2. Split the text provided by characters. Let's refer to each split as "part".
11
- Order of parts MUST be the same as in the original text.
12
-
13
- Details:
14
- - First, analyze the whole text to extract the list of characters.
15
- Put found characters to corresponding output field.
16
- - Then, analyze the text top-down and as you proceed fill the "parts" field
17
- - Each part must be attributed to a single character.
18
- Character must belong to the "characters" list
19
- - Use "narrator" character for any descriptive or narrative text,
20
- such as actions ("He shook his head"), narrative parts ("I thought")
21
- thoughts, or descriptions that aren't part of spoken dialogue
22
- - In some books narrator is one of the main characters, having its own name and phrases.
23
- In this case, use regualar character name instead of "narrator" role
24
- - If it's impossible to identify character name from the text provided, use codes "c1", "c2", etc,
25
- where "c" prefix means character and number is used to enumerate unknown characters
26
-
27
- Format your answer as a following JSON:
28
- {{
29
- "characters": [list of unique character names that are found in the text provided],
30
- "parts":
31
- [
32
- {{
33
- "character": <character name>, "text": <the part's text>
34
- }}
35
- ]
36
- }}
37
-
38
- Ensure the order of the parts in the JSON output matches the original order of the text.
39
-
40
- Examples of text split by characters, already in the target format.
41
-
42
- Example 1.
43
- {{
44
- "characters": ["Mr. Gatz", "narrator"],
45
- "parts":
46
- [
47
- {{"character": "Mr. Gatz", "text": "β€œGatz is my name.”"}},
48
- {{"character": "narrator", "text": "β€œβ€”Mr. Gatz. I thought you might want to take the body West.” He shook his head."}},
49
- {{"character": "Mr. Gatz", "text": "β€œJimmy always liked it better down East. He rose up to his position in the East. Were you a friend of my boy’s, Mr.β€”?”"}},
50
- {{"character": "narrator", "text": "β€œWe were close friends.”"}},
51
- {{"character": "Mr. Gatz", "text": "β€œHe had a big future before him, you know. He was only a young man, but he had a lot of brain power here.”"}},
52
- {{"character": "narrator", "text": "He touched his head impressively, and I nodded."}},
53
- {{"character": "Mr. Gatz", "text": "β€œIf he’d of lived, he’d of been a great man. A man like James J. Hill. He’d of helped build up the country.”"}},
54
- {{"character": "narrator", "text": "β€œThat’s true,” I said, uncomfortably."}},
55
- {{"character": "Mr. Gatz", "text": "He fumbled at the embroidered coverlet, trying to take it from the bed, and lay down stifflyβ€”was instantly asleep."}},
56
- ]
57
- }}
58
-
59
- Example 2.
60
- {{
61
- 'characters': [
62
- 'narrator',
63
- 'Mr. Carraway',
64
- 'Daisy',
65
- 'Miss Baker',
66
- 'Tom',
67
- 'Nick'
68
- ],
69
- 'parts': [
70
- {{'character': 'narrator', 'text': 'β€œIf you’ll get up.”'}},
71
- {{'character': 'Mr. Carraway', 'text': 'β€œI will. Good night, Mr. Carraway. See you anon.”'}},
72
- {{'character': 'Daisy', 'text': 'β€œOf course you will,” confirmed Daisy. β€œIn fact I think I’ll arrange a marriage. Come over often, Nick, and I’ll sort ofβ€”ohβ€”fling you together. You knowβ€”lock you up accidentally in linen closets and push you out to sea in a boat, and all that sort of thing—”'}},
73
- {{'character': 'Miss Baker', 'text': 'β€œGood night,” called Miss Baker from the stairs. β€œI haven’t heard a word.”'}},
74
- {{'character': 'Tom', 'text': 'β€œShe’s a nice girl,” said Tom after a moment. β€œThey oughtn’t to let her run around the country this way.”'}},
75
- {{'character': 'Daisy', 'text': 'β€œWho oughtn’t to?” inquired Daisy coldly.'}},
76
- {{'character': 'narrator', 'text': 'β€œHer family.”'}},
77
- {{'character': 'narrator', 'text': 'β€œHer family is one aunt about a thousand years old. Besides, Nick’s going to look after her, aren’t you, Nick? She’s going to spend lots of weekends out here this summer. I think the home influence will be very good for her.”'}},
78
- {{'character': 'narrator', 'text': 'Daisy and Tom looked at each other for a moment in silence.'}}
79
- ]
80
- }}
81
- """
82
-
83
- USER = """\
84
- Here is the book sample:
85
- ---
86
- {text}"""
87
-
88
-
89
- class SplitTextPromptV2:
90
  SYSTEM = """\
91
  you are provided with the book sample.
92
  please rewrite it and insert xml tags indicating character to whom current phrase belongs.
@@ -111,6 +23,36 @@ Here is the book sample:
111
  {text}"""
112
 
113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  class CharacterVoicePropertiesPrompt:
115
  SYSTEM = """\
116
  You are a helpful assistant proficient in literature and psychology.
@@ -156,3 +98,208 @@ NOTES:
156
  {characters}
157
  </characters>
158
  """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ class SplitTextPrompt:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  SYSTEM = """\
3
  you are provided with the book sample.
4
  please rewrite it and insert xml tags indicating character to whom current phrase belongs.
 
23
  {text}"""
24
 
25
 
26
+ class ModifyTextPrompt:
27
+ SYSTEM = """\
28
+ You are provided with the book sample.
29
+ You should help me to make an audiobook with exaggerated emotion-based voice using Text-to-Speech models.
30
+ Your task is to adjust the emotional tone of a given text by modifying the text in the following ways:
31
+ - add special characters: "!" (adds emphasis), "?" (enhances question intonation), "..." (adds pause)
32
+ - write words in uppercase - to add emphasis or convey emotion
33
+
34
+ For example:
35
+ Text: "I can't believe this is happening. Who would expect it?"
36
+ Output text: "I CAN'T believe this is happening... Who would expect it??"
37
+
38
+ Notes:
39
+ - Do not remove or add any words!
40
+ - You are allowed ONLY to add "!", "?", "..." symbols and re-write existing words in uppercase!
41
+ - To add more emotions, you can duplicate exclamation or question marks, for example: "!!!" or "???"
42
+ - DO NOT place "!" or "?" INSIDE existing sentences, since it breaks the sentence in parts
43
+ - Be generous on pauses between sentences or between the obviously different parts of the same sentence.
44
+ Reason is TTS model tends to dub with fast speed.
45
+ - But don't add too many pauses within one sentence! Add them only where needed.
46
+ - Remember: sentences must sound naturally, in the way a profession voice actor would read it!
47
+ - DO NOT add pauses in the very end of the given text!
48
+ """
49
+
50
+ USER = """\
51
+ Here is the book sample:
52
+ ---
53
+ {text}"""
54
+
55
+
56
  class CharacterVoicePropertiesPrompt:
57
  SYSTEM = """\
58
  You are a helpful assistant proficient in literature and psychology.
 
98
  {characters}
99
  </characters>
100
  """
101
+
102
+
103
+ class SoundEffectsPrompt:
104
+ SYSTEM = """\
105
+ You are an expert in directing audiobooks creation.
106
+ Your task is to design sound effects (by writing their text description) layed over the voice actors narration.
107
+ Sound effects descriptions are going to be passed to text-to-sound-effect AI model.
108
+ Sound effects must enhance storytelling and evoke immersive experience in listeners.
109
+
110
+ You are provided with the audiobook text chunk -
111
+ you must insert XML tags containing prompts for AI model describing sound effects.
112
+
113
+ XML effect tags must have following structure:
114
+ <effect prompt="prompt to be passed to text-to-sound-effect AI model">original line from the text</effect>
115
+
116
+ WRITE PROMPTS TO BE VERY RICH IN DETAILS, precisely describing the effect!
117
+ Your prompts MUST BE SPECIFIC, AVOID ABSTRACT sounds like "sound of a cozy room".
118
+
119
+ Generated sound effect will be overlayed over the text between the opening and the closing effect XML tag.
120
+ Use tags position to control start time of the effect and its duration.
121
+
122
+ Additional requirements:
123
+ - In the very beginning, analyze the whole text chunk provided in order to understand events and atmosphere.
124
+ - Aim for episodical sound effects, highlighting atmosphere and characters' actions.
125
+ For example, cracking of stairs, wind blowing, car honks, sound of a falling book, ticking clock
126
+ - NEVER generate background music
127
+ - NEVER generate ambient sounds, for example people's voices, sound of the crowd
128
+ - NEVER generate sounds for gestures, for example for a hand raised in the air.
129
+ - NEVER generate effects for sounds people may produce: laughing, giggling, sobbing, crying, talking, singing, screaming.
130
+ - NEVER generate silence, since it's a too abstract effect
131
+ - The text-to-sound-effects model is able to generate only short audio files, up to 5 seconds long
132
+ - Aim to position sound effects at the most intuitive points for a natural, immersive experience.
133
+ For example, instead of placing the sound effect only on a single word or object (like "stairs"),
134
+ tag a broader phrase making the effect feel part of the action or dialogue.
135
+ - It's allowed to add no sound effects
136
+
137
+ Examples of bad prompts:
138
+ 1. "brief silence, creating a moment of tension" - it's too short, not specific and is an ambient sound.
139
+ 2. "brief, tense silence, filled with unspoken words and a slight undercurrent of tension" - very abstract, and breaks the rule of not generating silence
140
+ 3. "sudden burst of bright light filling a room, creating a warm and inviting atmosphere" - abstract
141
+ 4. "sudden, commanding gesture of a hand being lifted, creating a brief pause in conversation" - abstract
142
+ 5. "exaggerated, humorous emphasis on the age, suggesting an ancient, creaky presence"
143
+
144
+ Examples of good prompts:
145
+ 1. "soft rustling of paper as a page is turned, delicate and gentle"
146
+ 2. "light thud of a magazine landing on a wooden table, slightly echoing in the quiet room"
147
+ 3. "Old wooden staircase creaking under slow footsteps, each step producing uneven crackles, groans, and occasional sharp snaps, emphasizing age and fragility in a quiet, echoing space" - it's specific and rich in details
148
+
149
+ Response with the original text with selected phrases wrapped inside emotion XML tags.
150
+ Do not modify original text!
151
+ Do not include anythin else in your answer.
152
+ """
153
+
154
+ USER = """\
155
+ {text}
156
+ """
157
+
158
+
159
+ # TODO: this prompt is not used
160
+ PREFIX = """\
161
+ You should help me to make an audiobook with realistic emotion sound using TTS.
162
+ You are tasked with generating a description of sound effects
163
+ that matches the atmosphere, actions, and tone of a given sentence or text from a book.
164
+ The description should be tailored to create a sound effect using ElevenLabs'sound generation API.
165
+ The generated sound description must evoke the scene
166
+ or emotions from the text (e.g., footsteps, wind, tense silence, etc.),
167
+ and it should be succinct and fit the mood of the text."""
168
+
169
+ # TODO: this prompt is not used
170
+ SOUND_EFFECT_GENERATION = f"""
171
+ {PREFIX}
172
+
173
+ Additionally, you should include the following parameters in your response:
174
+
175
+ Text: A generated description of the sound that matches the text provided.
176
+ Keep the description simple and effective to capture the soundscape.
177
+ This text will be converted into a sound effect.
178
+ Duration_seconds: The appropriate duration of the sound effect,
179
+ which should be calculated based on the length and nature of the scene.
180
+ Cap this duration at 22 seconds. But be carefully, for very long text in input make a long sound effect,
181
+ for small make a small one. And the duration should be similar to duration of input text
182
+ Prompt_influence: A value between 0 and 1, where a higher value makes the sound generation closely
183
+ follow the sound description. For general sound effects (e.g., footsteps, background ambiance),
184
+ use a value around 0.3. For more specific or detailed sound scenes
185
+ (e.g., thunderstorm, battle sounds), use a higher value like 0.5 to 0.7.
186
+
187
+ Your output should be in the following JSON format:
188
+
189
+ {{
190
+ "text": "A soft breeze rustling through leaves, distant birds chirping.",
191
+ "duration_seconds": 4.0,
192
+ "prompt_influence": 0.4
193
+ }}
194
+
195
+ NOTES:
196
+ - NEVER add any speech or voices in your instructions!
197
+ - NEVER add any music in your instructions!
198
+ - NEVER add city sounds, car honks in your instructions!
199
+ - make your text descriptions VERY SPECIFIC, AVOID vague instructions.
200
+ If it's necessary, you can use couple sentences to formulate the instruction.
201
+ But remember to use keep instructions simple.
202
+ - aim to create specific sounds, like crackling fireplace, footsteps, wind, etc...
203
+ """
204
+
205
+ # TODO: this prompt is not used
206
+ SOUND_EFFECT_GENERATION_WITHOUT_DURATION_PREDICTION = f"""
207
+ {PREFIX}
208
+
209
+ Additionally, you should include the following parameters in your response:
210
+
211
+ Text: A generated description of the sound that matches the text provided.
212
+ Keep the description simple and effective to capture the soundscape.
213
+ This text will be converted into a sound effect.
214
+ Prompt_influence: A value between 0 and 1, where a higher value makes the sound generation closely
215
+ follow the sound description. For general sound effects (e.g., footsteps, background ambiance),
216
+ use a value around 0.3. For more specific or detailed sound scenes
217
+ (e.g., thunderstorm, battle sounds), use a higher value like 0.5 to 0.7.
218
+
219
+ Your output should be in the following JSON format:
220
+
221
+ {{
222
+ "text": "A soft breeze rustling through leaves, distant birds chirping.",
223
+ "prompt_influence": 0.4
224
+ }}"""
225
+
226
+
227
+ EMOTION_STABILITY_MODIFICATION = """
228
+ You should help me to make an audiobook with exaggerated emotion-based voice using Text-to-Speech.
229
+ Your single task it to select "stability" TTS parameter value,
230
+ based on the emotional intensity level in the provided text chunk.
231
+
232
+ Provided text was previously modified by uppercasing some words and adding "!", "?", "..." symbols.
233
+ The more there are uppercase words or "!", "?", "..." symbols, the higher emotional intensity level is.
234
+ Higher emotional intensity must be associated with lower values of "stability" parameter,
235
+ and lower emotional intensity must be associated with higher "stability" values.
236
+ Low "stability" makes TTS to generate more expressive, less stable speech - better suited to convey emotional range.
237
+
238
+ Available range for "stability" values is [0.3; 0.8].
239
+
240
+ You MUST answer with the following JSON,
241
+ containing a SINGLE "stability" parameter with selected value:
242
+ {"stability": float}
243
+ DO NOT INCLUDE ANYTHING ELSE in your response.
244
+
245
+ Example:
246
+ Input: "I CAN'T believe this is happening... Who would expect it??"
247
+ Expected output: {"stability": 0.4}
248
+ """
249
+
250
+ # TODO: this prompt is not used
251
+ TEXT_MODIFICATION_WITH_SSML = """
252
+ You should help me to make an audiobook with overabundant emotion-based voice using TTS.
253
+ You are tasked with transforming the text provided into a sophisticated SSML script
254
+ that is optimized for emotionally, dramatically and breathtaking rich audiobook narration.
255
+ Analyze the text for underlying emotions, detect nuances in intonation, and discern the intended impact.
256
+ Apply suitable SSML enhancements to ensure that the final TTS output delivers
257
+ a powerful, engaging, dramatic and breathtaking listening experience appropriate for an audiobook context
258
+ (more effects/emotions are better than less)."
259
+
260
+ Please, use only provided SSML tags and don't generate any other tags.
261
+ Key SSML Tags to Utilize:
262
+ <speak>: This is the root element. All SSML content to be synthesized must be enclosed within this tag.
263
+ <prosody>: Manipulates pitch, rate, and volume to convey various emotions and emphases. Use this tag to adjust the voice to match the mood and tone of different parts of the narrative.
264
+ <break>: Inserts pauses of specified durations. Use this to create natural breaks in speech, aiding in dramatic effect and better comprehension for listeners.
265
+ <emphasis>: Adds stress to words or phrases to highlight key points or emotions, similar to vocal emphasis in natural speech.
266
+ <p> and <s>: Structural tags that denote paragraphs and sentences, respectively. They help to manage the flow and pacing of the narrative appropriately.
267
+
268
+ Input Text Example: "He stood there, gazing into the endless horizon. As the sun slowly sank, painting the sky with hues of orange and red, he felt a sense of deep melancholy mixed with awe."
269
+
270
+ Modified text should be in the XML format. Expected SSML-enriched Output:
271
+
272
+ <speak>
273
+ <p>
274
+ <s>
275
+ He stood there, <prosody rate="slow" volume="soft">gazing into the endless horizon.</prosody>
276
+ </s>
277
+ <s>
278
+ As the sun slowly <prosody rate="medium" pitch="-2st">sank,</prosody>
279
+ <prosody volume="medium" pitch="+1st">painting the sky with hues of orange and red,</prosody>
280
+ he felt a sense of deep <prosody volume="soft" pitch="-1st">melancholy</prosody> mixed with <emphasis level="moderate">awe.</emphasis>
281
+ </s>
282
+ </p>
283
+ </speak>
284
+
285
+ After modifying the text, adjust the "stability", "similarity_boost" and "style" parameters
286
+ according to the level of emotional intensity in the modified text.
287
+ Higher emotional intensity should lower the "stability" and raise the "similarity_boost".
288
+ Your output should be in the following JSON format:
289
+ {
290
+ "modified_text": "Modified text in xml format with SSML tags.",
291
+ "params": {
292
+ "stability": 0.7,
293
+ "similarity_boost": 0.5,
294
+ "style": 0.3
295
+ }
296
+ }
297
+
298
+ The "stability" parameter should range from 0 to 1,
299
+ with lower values indicating a more expressive, less stable voice.
300
+ The "similarity_boost" parameter should also range from 0 to 1,
301
+ with higher values indicating more emphasis on the voice similarity.
302
+ The "style" parameter should also range from 0 to 1,
303
+ where lower values indicate a neutral tone and higher values reflect more stylized or emotional delivery.
304
+ Adjust both according to the emotional intensity of the text.
305
+ """
src/schemas.py ADDED
@@ -0,0 +1,234 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import base64
4
+ import typing as t
5
+ from enum import StrEnum
6
+
7
+ import pandas as pd
8
+ from elevenlabs import VoiceSettings
9
+ from pydantic import BaseModel, ConfigDict, Field
10
+
11
+ from src import utils
12
+
13
+
14
+ class AudioOutputFormat(StrEnum):
15
+ MP3_22050_32 = "mp3_22050_32"
16
+ MP3_44100_32 = "mp3_44100_32"
17
+ MP3_44100_64 = "mp3_44100_64"
18
+ MP3_44100_96 = "mp3_44100_96"
19
+ MP3_44100_128 = "mp3_44100_128"
20
+ MP3_44100_192 = "mp3_44100_192"
21
+ PCM_16000 = "pcm_16000"
22
+ PCM_22050 = "pcm_22050"
23
+ PCM_24000 = "pcm_24000"
24
+ PCM_44100 = "pcm_44100"
25
+ ULAW_8000 = "ulaw_8000"
26
+
27
+
28
+ class ExtraForbidModel(BaseModel):
29
+ model_config = ConfigDict(extra="forbid")
30
+
31
+
32
+ # use Ellipsis to mark omitted function parameter.
33
+ # cast it to Any type to avoid warnings from type checkers
34
+ # exact same approach is used in elevenlabs client.
35
+ OMIT = t.cast(t.Any, ...)
36
+
37
+
38
+ class TTSParams(ExtraForbidModel):
39
+ # NOTE: pydantic treats Ellipsis as a mark of a required field.
40
+ # in order to set Ellipsis as actual field default value, we need to use workaround
41
+ # and use Field's default_factory
42
+
43
+ voice_id: str
44
+ text: str
45
+ # enable_logging: typing.Optional[bool] = None
46
+
47
+ # NOTE: we opt for quality over speed - thus don't use this param
48
+ # optimize_streaming_latency: typing.Optional[OptimizeStreamingLatency] = None
49
+
50
+ # NOTE: here we set default different from 11labs API
51
+ # output_format: AudioOutputFormat = AudioOutputFormat.MP3_44100_128
52
+ output_format: AudioOutputFormat = AudioOutputFormat.MP3_44100_192
53
+
54
+ # NOTE: pydantic has protected "model_" namespace.
55
+ # here we use workaround to pass "model_id" param to 11labs client
56
+ # via serialization_alias
57
+ audio_model_id: t.Optional[str] = Field(
58
+ default_factory=lambda: OMIT, serialization_alias="model_id"
59
+ )
60
+
61
+ language_code: t.Optional[str] = Field(default_factory=lambda: OMIT)
62
+
63
+ # reference: https://elevenlabs.io/docs/speech-synthesis/voice-settings
64
+ voice_settings: t.Optional[VoiceSettings] = Field(default_factory=lambda: OMIT)
65
+
66
+ # pronunciation_dictionary_locators: t.Optional[
67
+ # t.Sequence[PronunciationDictionaryVersionLocator]
68
+ # ] = Field(default_factory=lambda: OMIT)
69
+ seed: t.Optional[int] = Field(default_factory=lambda: OMIT)
70
+ previous_text: t.Optional[str] = Field(default_factory=lambda: OMIT)
71
+ next_text: t.Optional[str] = Field(default_factory=lambda: OMIT)
72
+ previous_request_ids: t.Optional[t.Sequence[str]] = Field(default_factory=lambda: OMIT)
73
+ next_request_ids: t.Optional[t.Sequence[str]] = Field(default_factory=lambda: OMIT)
74
+ # request_options: t.Optional[RequestOptions] = None
75
+
76
+ def to_dict(self):
77
+ """
78
+ dump the pydantic model in the format required by 11labs api.
79
+
80
+ NOTE: we need to use `by_alias=True` in order to correctly handle
81
+ alias for `audio_model_id` field,
82
+ since model_id belongs to pydantic protected namespace.
83
+
84
+ NOTE: we also ignore all fields with default Ellipsis value,
85
+ since 11labs will assign Ellipses itself,
86
+ and we won't get any warning in logs.
87
+ """
88
+ ellipsis_fields = {field for field, value in self if value is ...}
89
+ res = self.model_dump(by_alias=True, exclude=ellipsis_fields)
90
+ return res
91
+
92
+
93
+ class TTSTimestampsAlignment(ExtraForbidModel):
94
+ characters: list[str]
95
+ character_start_times_seconds: list[float]
96
+ character_end_times_seconds: list[float]
97
+ _text_joined: str
98
+
99
+ def __init__(self, **data):
100
+ super().__init__(**data)
101
+ self._text_joined = "".join(self.characters)
102
+
103
+ @property
104
+ def text_joined(self):
105
+ return self._text_joined
106
+
107
+ def to_dataframe(self):
108
+ return pd.DataFrame(
109
+ {
110
+ "char": self.characters,
111
+ "start": self.character_start_times_seconds,
112
+ "end": self.character_end_times_seconds,
113
+ }
114
+ )
115
+
116
+ @classmethod
117
+ def combine_alignments(
118
+ cls,
119
+ alignments: list[TTSTimestampsAlignment],
120
+ add_placeholders: bool = False,
121
+ pause_bw_chunks_s: float = 0.2,
122
+ ) -> TTSTimestampsAlignment:
123
+ """
124
+ Combine alignemnts created for different TTS phrases in a single aligment for a whole text.
125
+
126
+ NOTE: while splitting original text into character phrases,
127
+ we ignore separators between phrases.
128
+ They may be different: single or multiple spaces, newlines, etc.
129
+ To account for them we insert fixed pause and characters between phrases in final alignment.
130
+ This will give use an approximation of a real timestamp mapping
131
+ for voicing a whole original text.
132
+
133
+ NOTE: The quality of such approximation seems appropriate,
134
+ considering the amount of time required to implement more accurate mapping.
135
+ """
136
+
137
+ chars = []
138
+ starts = []
139
+ ends = []
140
+ prev_chunk_end_time = 0.0
141
+ n_alignments = len(alignments)
142
+
143
+ for ix, a in enumerate(alignments):
144
+ cur_starts_absolute = [prev_chunk_end_time + s for s in a.character_start_times_seconds]
145
+ cur_ends_absolute = [prev_chunk_end_time + e for e in a.character_end_times_seconds]
146
+
147
+ chars.extend(a.characters)
148
+ starts.extend(cur_starts_absolute)
149
+ ends.extend(cur_ends_absolute)
150
+
151
+ if ix < n_alignments - 1 and add_placeholders:
152
+ chars.append('#')
153
+ placeholder_start = cur_ends_absolute[-1]
154
+ starts.append(placeholder_start)
155
+ ends.append(placeholder_start + pause_bw_chunks_s)
156
+
157
+ prev_chunk_end_time = ends[-1]
158
+
159
+ return cls(
160
+ characters=chars,
161
+ character_start_times_seconds=starts,
162
+ character_end_times_seconds=ends,
163
+ )
164
+
165
+ def filter_chars_without_duration(self):
166
+ """
167
+ Create new class instance with characters with 0 duration removed.
168
+ Needed to provide correct alignment when overlaying sound effects.
169
+ """
170
+ df = self.to_dataframe()
171
+ mask = (df['start'] - df['end']).abs() > 1e-5
172
+ df = df[mask]
173
+
174
+ res = TTSTimestampsAlignment(
175
+ characters=df['char'].to_list(),
176
+ character_start_times_seconds=df['start'].to_list(),
177
+ character_end_times_seconds=df['end'].to_list(),
178
+ )
179
+
180
+ return res
181
+
182
+ def get_start_time_by_char_ix(self, char_ix: int, safe=True):
183
+ if safe:
184
+ char_ix = utils.get_collection_safe_index(
185
+ ix=char_ix,
186
+ collection=self.character_start_times_seconds,
187
+ )
188
+ return self.character_start_times_seconds[char_ix]
189
+
190
+ def get_end_time_by_char_ix(self, char_ix: int, safe=True):
191
+ if safe:
192
+ char_ix = utils.get_collection_safe_index(
193
+ ix=char_ix,
194
+ collection=self.character_end_times_seconds,
195
+ )
196
+ return self.character_end_times_seconds[char_ix]
197
+
198
+
199
+ class TTSTimestampsResponse(ExtraForbidModel):
200
+ audio_base64: str
201
+ alignment: TTSTimestampsAlignment
202
+ normalized_alignment: TTSTimestampsAlignment
203
+
204
+ @property
205
+ def audio_bytes(self):
206
+ return base64.b64decode(self.audio_base64)
207
+
208
+ def write_audio_to_file(self, filepath_no_ext: str, audio_format: AudioOutputFormat) -> str:
209
+ if audio_format.startswith("pcm_"):
210
+ sr = int(audio_format.removeprefix("pcm_"))
211
+ fp = f"{filepath_no_ext}.wav"
212
+ utils.write_raw_pcm_to_file(
213
+ data=self.audio_bytes,
214
+ fp=fp,
215
+ n_channels=1, # seems like it's 1 channel always
216
+ bytes_depth=2, # seems like it's 2 bytes always
217
+ sampling_rate=sr,
218
+ )
219
+ return fp
220
+ elif audio_format.startswith("mp3_"):
221
+ fp = f"{filepath_no_ext}.mp3"
222
+ # received mp3 seems to already contain all required metadata
223
+ # like sampling rate
224
+ # and sample width
225
+ utils.write_bytes(data=self.audio_bytes, fp=fp)
226
+ return fp
227
+ else:
228
+ raise ValueError(f"don't know how to write audio format: {audio_format}")
229
+
230
+
231
+ class SoundEffectsParams(ExtraForbidModel):
232
+ text: str
233
+ duration_seconds: float | None
234
+ prompt_influence: float | None
src/select_voice_chain.py CHANGED
@@ -10,10 +10,9 @@ from langchain_core.prompts import (
10
  from langchain_core.runnables import RunnablePassthrough
11
  from pydantic import BaseModel
12
 
13
- from src.config import logger
14
  from src.prompts import CharacterVoicePropertiesPrompt
15
  from src.utils import GPTModels, get_chat_llm
16
- from src.config import VOICES_CSV_FP
17
 
18
 
19
  class Property(StrEnum):
@@ -121,9 +120,7 @@ class VoiceSelector:
121
 
122
  return character2voice
123
 
124
- def _remove_hallucinations_single_character(
125
- self, character_props: CharacterProperties
126
- ):
127
  def _process_prop(prop: Property, value: str):
128
  if value not in self.PROPERTY_VALUES[prop]:
129
  logger.warning(
@@ -134,9 +131,7 @@ class VoiceSelector:
134
 
135
  return CharacterPropertiesNullable(
136
  gender=_process_prop(prop=Property.gender, value=character_props.gender),
137
- age_group=_process_prop(
138
- prop=Property.age_group, value=character_props.age_group
139
- ),
140
  )
141
 
142
  def remove_hallucinations(
@@ -167,28 +162,20 @@ class VoiceSelector:
167
 
168
  prompt = ChatPromptTemplate.from_messages(
169
  [
170
- SystemMessagePromptTemplate.from_template(
171
- CharacterVoicePropertiesPrompt.SYSTEM
172
- ),
173
- HumanMessagePromptTemplate.from_template(
174
- CharacterVoicePropertiesPrompt.USER
175
- ),
176
  ]
177
  )
178
  prompt = prompt.partial(
179
  **{
180
  "available_genders": self.get_available_properties_str(Property.gender),
181
- "available_age_groups": self.get_available_properties_str(
182
- Property.age_group
183
- ),
184
  "format_instructions": format_instructions,
185
  }
186
  )
187
 
188
  chain = (
189
- RunnablePassthrough.assign(
190
- charater_props=prompt | llm | self.remove_hallucinations
191
- )
192
  | RunnablePassthrough.assign(character2voice=self.get_voices)
193
  | self.pack_results
194
  )
 
10
  from langchain_core.runnables import RunnablePassthrough
11
  from pydantic import BaseModel
12
 
13
+ from src.config import VOICES_CSV_FP, logger
14
  from src.prompts import CharacterVoicePropertiesPrompt
15
  from src.utils import GPTModels, get_chat_llm
 
16
 
17
 
18
  class Property(StrEnum):
 
120
 
121
  return character2voice
122
 
123
+ def _remove_hallucinations_single_character(self, character_props: CharacterProperties):
 
 
124
  def _process_prop(prop: Property, value: str):
125
  if value not in self.PROPERTY_VALUES[prop]:
126
  logger.warning(
 
131
 
132
  return CharacterPropertiesNullable(
133
  gender=_process_prop(prop=Property.gender, value=character_props.gender),
134
+ age_group=_process_prop(prop=Property.age_group, value=character_props.age_group),
 
 
135
  )
136
 
137
  def remove_hallucinations(
 
162
 
163
  prompt = ChatPromptTemplate.from_messages(
164
  [
165
+ SystemMessagePromptTemplate.from_template(CharacterVoicePropertiesPrompt.SYSTEM),
166
+ HumanMessagePromptTemplate.from_template(CharacterVoicePropertiesPrompt.USER),
 
 
 
 
167
  ]
168
  )
169
  prompt = prompt.partial(
170
  **{
171
  "available_genders": self.get_available_properties_str(Property.gender),
172
+ "available_age_groups": self.get_available_properties_str(Property.age_group),
 
 
173
  "format_instructions": format_instructions,
174
  }
175
  )
176
 
177
  chain = (
178
+ RunnablePassthrough.assign(charater_props=prompt | llm | self.remove_hallucinations)
 
 
179
  | RunnablePassthrough.assign(character2voice=self.get_voices)
180
  | self.pack_results
181
  )
src/sound_effects_design.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+
3
+ from langchain_core.output_parsers import StrOutputParser
4
+ from langchain_core.prompts import (
5
+ ChatPromptTemplate,
6
+ HumanMessagePromptTemplate,
7
+ SystemMessagePromptTemplate,
8
+ )
9
+ from langchain_core.runnables import RunnablePassthrough
10
+ from pydantic import BaseModel
11
+
12
+ from src import prompts
13
+ from src.utils import GPTModels, get_chat_llm
14
+
15
+
16
+ class SoundEffectDescription(BaseModel):
17
+ prompt: str
18
+ text_between_tags: str
19
+ # indices relative to LLM response
20
+ ix_start_llm_response: int
21
+ ix_end_llm_response: int
22
+ # indices relative to original text passed to LLM
23
+ ix_start_orig_text: int
24
+ ix_end_orig_text: int
25
+ # NOTE: start_sec and duration_sec fields
26
+ # are going to be filled once TTS audio is generated
27
+ start_sec: float = -1.0
28
+ duration_sec: float = -1.0
29
+
30
+
31
+ class SoundEffectsDesignOutput(BaseModel):
32
+ text_raw: str
33
+ text_annotated: str
34
+ _sound_effects_descriptions: list[SoundEffectDescription]
35
+
36
+ @staticmethod
37
+ def _parse_effects_xml_tags(text) -> list[SoundEffectDescription]:
38
+ """
39
+ we rely on LLM to format response correctly.
40
+ and currently don't try to fix possible errors.
41
+ """
42
+ # TODO: allow to open-close tags
43
+ # <effect prompt=\"(.*?)\" duration=\"(.*)\"/>
44
+
45
+ pattern = re.compile(r"<effect prompt=(?:\"|')(.*?)(?:\"|')>(.*?)</effect>")
46
+ all_matches = list(pattern.finditer(text))
47
+
48
+ sound_effects_descriptions = []
49
+
50
+ rm_chars_running_total = 0
51
+ for m in all_matches:
52
+ mstart, mend = m.span()
53
+ prompt = m.group(1)
54
+ text_between_tags = m.group(2)
55
+
56
+ ix_start_orig = mstart - rm_chars_running_total
57
+ ix_end_orig = ix_start_orig + len(text_between_tags)
58
+
59
+ sound_effects_descriptions.append(
60
+ SoundEffectDescription(
61
+ prompt=prompt,
62
+ text_between_tags=text_between_tags,
63
+ ix_start_llm_response=mstart,
64
+ ix_end_llm_response=mend,
65
+ ix_start_orig_text=ix_start_orig,
66
+ ix_end_orig_text=ix_end_orig,
67
+ )
68
+ )
69
+
70
+ mlen = mend - mstart
71
+ rm_chars_running_total += mlen - len(text_between_tags)
72
+
73
+ return sound_effects_descriptions
74
+
75
+ def __init__(self, **data):
76
+ super().__init__(**data)
77
+ self._sound_effects_descriptions = self._parse_effects_xml_tags(self.text_annotated)
78
+
79
+ @property
80
+ def sound_effects_descriptions(self) -> list[SoundEffectDescription]:
81
+ return self._sound_effects_descriptions
82
+
83
+
84
+ def create_sound_effects_design_chain(llm_model: GPTModels):
85
+ llm = get_chat_llm(llm_model=llm_model, temperature=0.0)
86
+
87
+ prompt = ChatPromptTemplate.from_messages(
88
+ [
89
+ SystemMessagePromptTemplate.from_template(prompts.SoundEffectsPrompt.SYSTEM),
90
+ HumanMessagePromptTemplate.from_template(prompts.SoundEffectsPrompt.USER),
91
+ ]
92
+ )
93
+
94
+ chain = RunnablePassthrough.assign(text_annotated=prompt | llm | StrOutputParser()) | (
95
+ lambda inputs: SoundEffectsDesignOutput(
96
+ text_raw=inputs["text"], text_annotated=inputs["text_annotated"]
97
+ )
98
+ )
99
+ return chain
src/text_modification_chain.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain.prompts import (
2
+ ChatPromptTemplate,
3
+ HumanMessagePromptTemplate,
4
+ SystemMessagePromptTemplate,
5
+ )
6
+ from langchain_core.output_parsers import StrOutputParser
7
+ from langchain_core.runnables import RunnablePassthrough
8
+ from pydantic import BaseModel
9
+
10
+ from src.prompts import ModifyTextPrompt
11
+ from src.utils import GPTModels, get_chat_llm
12
+
13
+
14
+ class ModifiedTextOutput(BaseModel):
15
+ text_raw: str
16
+ text_modified: str
17
+
18
+
19
+ def modify_text_chain(llm_model: GPTModels):
20
+ llm = get_chat_llm(llm_model=llm_model, temperature=0.0)
21
+
22
+ prompt = ChatPromptTemplate.from_messages(
23
+ [
24
+ SystemMessagePromptTemplate.from_template(ModifyTextPrompt.SYSTEM),
25
+ HumanMessagePromptTemplate.from_template(ModifyTextPrompt.USER),
26
+ ]
27
+ )
28
+
29
+ chain = RunnablePassthrough.assign(text_modified=prompt | llm | StrOutputParser()) | (
30
+ lambda inputs: ModifiedTextOutput(
31
+ text_raw=inputs["text"], text_modified=inputs["text_modified"]
32
+ )
33
+ )
34
+ return chain
src/text_split_chain.py CHANGED
@@ -9,7 +9,7 @@ from langchain_core.prompts import (
9
  from langchain_core.runnables import RunnablePassthrough
10
  from pydantic import BaseModel
11
 
12
- from src.prompts import SplitTextPromptV1, SplitTextPromptV2
13
  from src.utils import GPTModels, get_chat_llm
14
 
15
 
@@ -63,66 +63,14 @@ def create_split_text_chain(llm_model: GPTModels):
63
 
64
  prompt = ChatPromptTemplate.from_messages(
65
  [
66
- SystemMessagePromptTemplate.from_template(SplitTextPromptV2.SYSTEM),
67
- HumanMessagePromptTemplate.from_template(SplitTextPromptV2.USER),
68
  ]
69
  )
70
 
71
- chain = RunnablePassthrough.assign(
72
- text_annotated=prompt | llm | StrOutputParser()
73
- ) | (
74
  lambda inputs: SplitTextOutput(
75
  text_raw=inputs["text"], text_annotated=inputs["text_annotated"]
76
  )
77
  )
78
  return chain
79
-
80
-
81
- ###### old code ######
82
-
83
-
84
- class CharacterAnnotatedText(BaseModel):
85
- phrases: list[CharacterPhrase]
86
- _characters: list[str]
87
-
88
- def __init__(self, **data):
89
- super().__init__(**data)
90
- self._characters = list(set(phrase.character for phrase in self.phrases))
91
-
92
- @property
93
- def characters(self):
94
- return self._characters
95
-
96
- def to_pretty_text(self):
97
- lines = []
98
- lines.append(f"characters: {self.characters}")
99
- lines.append("-" * 20)
100
- lines.extend(f"[{phrase.character}] {phrase.text}" for phrase in self.phrases)
101
- res = "\n".join(lines)
102
- return res
103
-
104
-
105
- class SplitTextOutputOld(BaseModel):
106
- characters: list[str]
107
- parts: list[CharacterPhrase]
108
-
109
- def to_character_annotated_text(self):
110
- return CharacterAnnotatedText(phrases=self.parts)
111
-
112
-
113
- def create_split_text_chain_old(llm_model: GPTModels):
114
- llm = get_chat_llm(llm_model=llm_model, temperature=0.0)
115
- llm = llm.with_structured_output(SplitTextOutputOld, method="json_mode")
116
-
117
- prompt = ChatPromptTemplate.from_messages(
118
- [
119
- SystemMessagePromptTemplate.from_template(SplitTextPromptV1.SYSTEM),
120
- HumanMessagePromptTemplate.from_template(SplitTextPromptV1.USER),
121
- ]
122
- )
123
-
124
- chain = prompt | llm
125
- return chain
126
-
127
-
128
- ## end of old code ##
 
9
  from langchain_core.runnables import RunnablePassthrough
10
  from pydantic import BaseModel
11
 
12
+ from src.prompts import SplitTextPrompt
13
  from src.utils import GPTModels, get_chat_llm
14
 
15
 
 
63
 
64
  prompt = ChatPromptTemplate.from_messages(
65
  [
66
+ SystemMessagePromptTemplate.from_template(SplitTextPrompt.SYSTEM),
67
+ HumanMessagePromptTemplate.from_template(SplitTextPrompt.USER),
68
  ]
69
  )
70
 
71
+ chain = RunnablePassthrough.assign(text_annotated=prompt | llm | StrOutputParser()) | (
 
 
72
  lambda inputs: SplitTextOutput(
73
  text_raw=inputs["text"], text_annotated=inputs["text_annotated"]
74
  )
75
  )
76
  return chain
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/tts.py CHANGED
@@ -1,32 +1,19 @@
1
  import typing as t
 
2
 
3
  from dotenv import load_dotenv
4
- from elevenlabs.client import AsyncElevenLabs, ElevenLabs
5
  from elevenlabs import VoiceSettings
 
6
 
7
  load_dotenv()
8
 
9
- from src.config import logger, ELEVENLABS_API_KEY
 
10
  from src.utils import auto_retry
11
 
12
- ELEVEN_CLIENT = ElevenLabs(api_key=ELEVENLABS_API_KEY)
13
-
14
  ELEVEN_CLIENT_ASYNC = AsyncElevenLabs(api_key=ELEVENLABS_API_KEY)
15
 
16
 
17
- def tts_stream(voice_id: str, text: str) -> t.Iterator[bytes]:
18
- async_iter = ELEVEN_CLIENT.text_to_speech.convert(voice_id=voice_id, text=text)
19
- for chunk in async_iter:
20
- if chunk:
21
- yield chunk
22
-
23
-
24
- def tts(voice_id: str, text: str):
25
- tts_iter = tts_stream(voice_id=voice_id, text=text)
26
- combined = b"".join(tts_iter)
27
- return combined
28
-
29
-
30
  async def tts_astream(
31
  voice_id: str, text: str, params: dict | None = None
32
  ) -> t.AsyncIterator[bytes]:
@@ -50,26 +37,47 @@ async def tts_astream(
50
 
51
 
52
  @auto_retry
53
- async def tts_astream_consumed(
54
- voice_id: str, text: str, params: dict | None = None
55
- ) -> list[bytes]:
56
  aiterator = tts_astream(voice_id=voice_id, text=text, params=params)
57
  return [x async for x in aiterator]
58
 
59
 
60
- async def sound_generation_astream(
61
- sound_generation_data: dict,
62
- ) -> t.AsyncIterator[bytes]:
63
- text = sound_generation_data.pop("text")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  logger.info(
65
- f"request to 11labs sound effect generation with params {sound_generation_data} "
66
- f'for the following text: "{text}"'
67
  )
68
 
69
  async_iter = ELEVEN_CLIENT_ASYNC.text_to_sound_effects.convert(
70
- text=text,
71
- duration_seconds=sound_generation_data["duration_seconds"],
72
- prompt_influence=sound_generation_data["prompt_influence"],
73
  )
74
  async for chunk in async_iter:
75
  if chunk:
@@ -77,6 +85,6 @@ async def sound_generation_astream(
77
 
78
 
79
  @auto_retry
80
- async def sound_generation_consumed(sound_generation_data: dict):
81
- aiterator = sound_generation_astream(sound_generation_data=sound_generation_data)
82
  return [x async for x in aiterator]
 
1
  import typing as t
2
+ from copy import deepcopy
3
 
4
  from dotenv import load_dotenv
 
5
  from elevenlabs import VoiceSettings
6
+ from elevenlabs.client import AsyncElevenLabs
7
 
8
  load_dotenv()
9
 
10
+ from src.config import ELEVENLABS_API_KEY, logger
11
+ from src.schemas import SoundEffectsParams, TTSParams, TTSTimestampsResponse
12
  from src.utils import auto_retry
13
 
 
 
14
  ELEVEN_CLIENT_ASYNC = AsyncElevenLabs(api_key=ELEVENLABS_API_KEY)
15
 
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  async def tts_astream(
18
  voice_id: str, text: str, params: dict | None = None
19
  ) -> t.AsyncIterator[bytes]:
 
37
 
38
 
39
  @auto_retry
40
+ async def tts_astream_consumed(voice_id: str, text: str, params: dict | None = None) -> list[bytes]:
 
 
41
  aiterator = tts_astream(voice_id=voice_id, text=text, params=params)
42
  return [x async for x in aiterator]
43
 
44
 
45
+ @auto_retry
46
+ async def tts_w_timestamps(params: TTSParams) -> TTSTimestampsResponse:
47
+ async def _tts_w_timestamps(params: TTSParams) -> TTSTimestampsResponse:
48
+ # NOTE: we need to use special `to_dict()` method to ensure pydantic model is converted
49
+ # to dict with proper aliases
50
+ params_dict = params.to_dict()
51
+
52
+ params_no_text = deepcopy(params_dict)
53
+ text = params_no_text.pop('text')
54
+ logger.info(
55
+ f"request to 11labs TTS endpoint with params {params_no_text} "
56
+ f'for the following text: "{text}"'
57
+ )
58
+
59
+ response_raw = await ELEVEN_CLIENT_ASYNC.text_to_speech.convert_with_timestamps(
60
+ **params_dict
61
+ )
62
+
63
+ response_parsed = TTSTimestampsResponse.model_validate(response_raw)
64
+ return response_parsed
65
+
66
+ res = await _tts_w_timestamps(params=params)
67
+ return res
68
+
69
+
70
+ async def sound_generation_astream(params: SoundEffectsParams) -> t.AsyncIterator[bytes]:
71
+ params_no_text = params.model_dump(exclude={"text"})
72
  logger.info(
73
+ f"request to 11labs sound effect generation with params {params_no_text} "
74
+ f'for the following text: "{params.text}"'
75
  )
76
 
77
  async_iter = ELEVEN_CLIENT_ASYNC.text_to_sound_effects.convert(
78
+ text=params.text,
79
+ duration_seconds=params.duration_seconds,
80
+ prompt_influence=params.prompt_influence,
81
  )
82
  async for chunk in async_iter:
83
  if chunk:
 
85
 
86
 
87
  @auto_retry
88
+ async def sound_generation_consumed(params: SoundEffectsParams):
89
+ aiterator = sound_generation_astream(params=params)
90
  return [x async for x in aiterator]
src/utils.py CHANGED
@@ -1,12 +1,20 @@
 
 
 
 
 
 
 
1
  from enum import StrEnum
 
2
 
 
3
  from httpx import Timeout
4
  from langchain_openai import ChatOpenAI
5
- from tenacity import (
6
- retry,
7
- stop_after_attempt,
8
- wait_random_exponential,
9
- )
10
 
11
 
12
  class GPTModels(StrEnum):
@@ -17,18 +25,148 @@ class GPTModels(StrEnum):
17
 
18
  def get_chat_llm(llm_model: GPTModels, temperature=0.0):
19
  llm = ChatOpenAI(
20
- model=llm_model, temperature=temperature, timeout=Timeout(60, connect=4)
 
 
21
  )
22
  return llm
23
 
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  async def consume_aiter(aiterator):
26
  return [x async for x in aiterator]
27
 
28
 
29
  def auto_retry(f):
30
  decorator = retry(
31
- wait=wait_random_exponential(min=2, max=6),
32
- stop=stop_after_attempt(10),
33
  )
34
  return decorator(f)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import datetime
2
+ import json
3
+ import re
4
+ import shutil
5
+ import typing as t
6
+ import wave
7
+ from collections.abc import Sized
8
  from enum import StrEnum
9
+ from pathlib import Path
10
 
11
+ import pandas as pd
12
  from httpx import Timeout
13
  from langchain_openai import ChatOpenAI
14
+ from pydub import AudioSegment
15
+ from tenacity import retry, stop_after_attempt, wait_random_exponential
16
+
17
+ from src.config import logger, VOICES_CSV_FP
 
18
 
19
 
20
  class GPTModels(StrEnum):
 
25
 
26
  def get_chat_llm(llm_model: GPTModels, temperature=0.0):
27
  llm = ChatOpenAI(
28
+ model=llm_model,
29
+ temperature=temperature,
30
+ timeout=Timeout(60, connect=4),
31
  )
32
  return llm
33
 
34
 
35
+ def get_collection_safe_index(ix: int, collection: Sized):
36
+ res = min(ix, len(collection) - 1)
37
+ res = max(0, res)
38
+ return res
39
+
40
+
41
+ def write_txt(txt: str, fp: str):
42
+ with open(fp, 'w', encoding='utf-8') as fout:
43
+ fout.write(txt)
44
+
45
+
46
+ def write_json(data, fp: str, indent=2):
47
+ with open(fp, 'w', encoding='utf-8') as fout:
48
+ json.dump(data, fout, indent=indent, ensure_ascii=False)
49
+
50
+
51
+ def rm_dir_conditional(dp: str, to_remove=True):
52
+ if not to_remove:
53
+ return
54
+ logger.info(f'removing dir: "{dp}"')
55
+ try:
56
+ shutil.rmtree(dp)
57
+ except Exception:
58
+ logger.exception(f'failed to remove dir')
59
+
60
+
61
+ def get_utc_now_str():
62
+ now = datetime.datetime.now(tz=datetime.UTC)
63
+ now_str = now.strftime('%Y%m%d-%H%M%S')
64
+ return now_str
65
+
66
+
67
  async def consume_aiter(aiterator):
68
  return [x async for x in aiterator]
69
 
70
 
71
  def auto_retry(f):
72
  decorator = retry(
73
+ wait=wait_random_exponential(min=3, max=10),
74
+ stop=stop_after_attempt(20),
75
  )
76
  return decorator(f)
77
+
78
+
79
+ def write_bytes(data: bytes, fp: str):
80
+ logger.info(f'saving to: "{fp}"')
81
+ with open(fp, "wb") as fout:
82
+ fout.write(data)
83
+
84
+
85
+ def write_chunked_bytes(data: t.Iterable[bytes], fp: str):
86
+ logger.info(f'saving to: "{fp}"')
87
+ with open(fp, "wb") as fout:
88
+ for chunk in data:
89
+ if chunk:
90
+ fout.write(chunk)
91
+
92
+
93
+ def write_raw_pcm_to_file(data: bytes, fp: str, n_channels: int, bytes_depth: int, sampling_rate):
94
+ logger.info(f'saving to: "{fp}"')
95
+ with wave.open(fp, "wb") as f:
96
+ f.setnchannels(n_channels)
97
+ f.setsampwidth(bytes_depth)
98
+ f.setframerate(sampling_rate)
99
+ f.writeframes(data)
100
+
101
+
102
+ def get_audio_duration(filepath: str) -> float:
103
+ """
104
+ Returns the duration of the audio file in seconds.
105
+
106
+ :param filepath: Path to the audio file.
107
+ :return: Duration of the audio file in seconds.
108
+ """
109
+ audio = AudioSegment.from_file(filepath)
110
+ # Convert milliseconds to seconds
111
+ duration_in_seconds = len(audio) / 1000
112
+ return round(duration_in_seconds, 1)
113
+
114
+
115
+ def normalize_audio(audio_segment: AudioSegment, target_dBFS: float = -20.0) -> AudioSegment:
116
+ """Normalize an audio segment to the target dBFS level."""
117
+ delta = target_dBFS - audio_segment.dBFS
118
+ res = audio_segment.apply_gain(delta)
119
+ return res
120
+
121
+
122
+ def overlay_multiple_audio(
123
+ main_audio_fp: str,
124
+ audios_to_overlay_fps: list[str],
125
+ starts_sec: list[float], # list of start positions, in seconds
126
+ out_fp: str,
127
+ ):
128
+ main_audio = AudioSegment.from_file(main_audio_fp)
129
+ for fp, cur_start_sec in zip(audios_to_overlay_fps, starts_sec):
130
+ audio_to_overlay = AudioSegment.from_file(fp)
131
+ # NOTE: quote from the documentation:
132
+ # "The result is always the same length as this AudioSegment"
133
+ # reference: https://github.com/jiaaro/pydub/blob/master/API.markdown#audiosegmentoverlay
134
+ # NOTE: `position` params is offset time in milliseconds
135
+ start_ms = int(cur_start_sec * 1000)
136
+ main_audio = main_audio.overlay(audio_to_overlay, position=start_ms)
137
+
138
+ logger.info(f'saving overlayed audio to: "{out_fp}"')
139
+ main_audio.export(out_fp, format='wav')
140
+
141
+
142
+ def get_audio_from_voice_id(voice_id: str) -> str:
143
+ voices_df = pd.read_csv(VOICES_CSV_FP)
144
+ data = voices_df[voices_df["voice_id"] == voice_id]["preview_url"].values[0]
145
+ return data
146
+
147
+
148
+ def get_character_color(character: str) -> str:
149
+ if not character or character == "Unassigned":
150
+ return "#808080"
151
+ colors = [
152
+ "#FF6B6B", # pale red
153
+ "#ed1262", # magenta-red
154
+ "#ed2bac", # magenta
155
+ "#892ed5", # purple
156
+ "#4562f7", # blue
157
+ "#11ab99", # cyan
158
+ "#58f23a", # green
159
+ # "#96CEB4", # light green
160
+ # "#D4A5A5", # light red
161
+ ]
162
+ hash_val = sum(ord(c) for c in character)
163
+ return colors[hash_val % len(colors)]
164
+
165
+
166
+ def prettify_unknown_character_label(text):
167
+ return re.sub(r'\bc(\d+)\b', r'Character\1', text)
168
+
169
+
170
+ def hex_to_rgb(hex_color):
171
+ hex_color = hex_color.lstrip('#')
172
+ return f"{int(hex_color[0:2], 16)},{int(hex_color[2:4], 16)},{int(hex_color[4:6], 16)}"
src/web/constructor.py ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from src.web.utils import create_status_html
2
+
3
+
4
+ class HTMLGenerator:
5
+ @staticmethod
6
+ def generate_error(text: str) -> str:
7
+ return create_status_html("Error", [], error_text=text)
8
+
9
+ @staticmethod
10
+ def generate_status(stage_title: str, steps: list[tuple[str, bool]]) -> str:
11
+ return create_status_html(stage_title, steps) + "</div>"
12
+
13
+ @staticmethod
14
+ def generate_text_split(text_split_html: str) -> str:
15
+ return f'''
16
+ <div class="section" style="background-color: #31395294; padding: 1rem; border-radius: 8px; margin-top: 1rem; color: #e0e0e0;">
17
+ <h3 style="color: rgb(224, 224, 224); font-size: 1.5em; margin-bottom: 1rem;">Text Split by Character:</h3>
18
+ {text_split_html}
19
+ </div>
20
+ '''
21
+
22
+ @staticmethod
23
+ def generate_voice_assignments(voice_assignments_html: str) -> str:
24
+ return f'''
25
+ <div class="section" style="background-color: #31395294; padding: 1rem; border-radius: 8px; margin-top: 1rem; color: #e0e0e0;">
26
+ <h3 style="color: rgb(224, 224, 224); font-size: 1.5em; margin-bottom: 1rem;">Voice Assignments:</h3>
27
+ {voice_assignments_html}
28
+ </div>
29
+ '''
30
+
31
+ @staticmethod
32
+ def generate_message_without_voice_id() -> str:
33
+ return '''
34
+ <div class="audiobook-ready" style="background-color: #31395294; padding: 1rem; border-radius: 8px; margin-top: 1rem; text-align: center;">
35
+ <h3 style="color: rgb(224, 224, 224); font-size: 1.5em; margin-bottom: 1rem;">🫀 At first you should add your voice</h3>
36
+ </div>
37
+ '''
38
+
39
+ @staticmethod
40
+ def generate_final_message() -> str:
41
+ return '''
42
+ <div class="audiobook-ready" style="background-color: #31395294; padding: 1rem; border-radius: 8px; margin-top: 1rem; text-align: center;">
43
+ <h3 style="color: rgb(224, 224, 224); font-size: 1.5em; margin-bottom: 1rem;">πŸŽ‰ Your audiobook is ready!</h3>
44
+ </div>
45
+ '''
src/web/utils.py ADDED
@@ -0,0 +1,345 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from src.sound_effects_design import SoundEffectDescription
2
+ from src.text_split_chain import CharacterPhrase
3
+ from src.utils import (
4
+ get_audio_from_voice_id,
5
+ get_character_color,
6
+ get_collection_safe_index,
7
+ hex_to_rgb,
8
+ prettify_unknown_character_label,
9
+ )
10
+ from src.web.variables import EFFECT_CSS
11
+
12
+
13
+ def create_status_html(status: str, steps: list[tuple[str, bool]], error_text: str = '') -> str:
14
+ # CSS for the spinner animation
15
+ spinner_css = """
16
+ @keyframes spin {
17
+ 0% { transform: rotate(0deg); }
18
+ 100% { transform: rotate(360deg); }
19
+ }
20
+ .spinner {
21
+ width: 20px;
22
+ height: 20px;
23
+ border: 3px solid #e0e0e0;
24
+ border-top: 3px solid #3498db;
25
+ border-radius: 50%;
26
+ animation: spin 1s linear infinite;
27
+ display: inline-block;
28
+ }
29
+ """
30
+
31
+ steps_html = "\n".join(
32
+ [
33
+ f'<div class="step-item" style="display: flex; align-items: center; padding: 0.8rem; margin-bottom: 0.5rem; background-color: #31395294; border-radius: 6px; font-weight: 600;">'
34
+ f'<span class="step-icon" style="margin-right: 1rem; font-size: 1.3rem;">'
35
+ f'{"βœ…" if completed else "<div class='spinner'></div>"}'
36
+ f'</span>'
37
+ f'<span class="step-text" style="font-size: 1.1rem; color: #e0e0e0;">{step}</span>'
38
+ f'</div>'
39
+ for step, completed in steps
40
+ ]
41
+ )
42
+
43
+ # status_description = '<p class="status-description" style="margin: 0.5rem 0 0 0; color: #c0c0c0; font-size: 1rem; font-weight: 400;">Processing steps below.</p>'
44
+ status_description = ''
45
+
46
+ if error_text:
47
+ error_html = f'<div class="error-message" style="color: #e53e3e; font-size: 1.2em;">{error_text}</div></div>'
48
+ else:
49
+ error_html = ''
50
+
51
+ return f'''
52
+ <div class="status-container" style="font-family: system-ui; max-width: 1472px; margin: 0 auto; background-color: #31395294; padding: 1rem; border-radius: 8px; color: #f0f0f0;">
53
+ <style>{spinner_css}</style>
54
+ <div class="status-header" style="background: #31395294; padding: 1rem; border-radius: 8px; font-weight: bold;">
55
+ <h3 class="status-title" style="margin: 0; color: rgb(224, 224, 224); font-size: 1.5rem; font-weight: 700;">Status: {status}</h3>
56
+ {status_description}
57
+ {error_html}
58
+ </div>
59
+ <div class="steps" style="margin-top: 1rem;">
60
+ {steps_html}
61
+ </div>
62
+ </div>
63
+ '''
64
+
65
+
66
+ def create_effect_span_prefix_postfix(effect_description: str):
67
+ """Create an HTML span with effect tooltip."""
68
+ # NOTE: it's important not to use multiline python string in order not to add whitespaces
69
+ prefix = (
70
+ '<span class="character-segment">'
71
+ '<span class="effect-container">'
72
+ '<span class="effect-text">'
73
+ )
74
+
75
+ postfix = (
76
+ '</span>'
77
+ f'<span class="effect-tooltip">Effect: {effect_description}</span>'
78
+ '</span>'
79
+ '</span>'
80
+ )
81
+
82
+ return prefix, postfix
83
+
84
+
85
+ def create_effect_span(text: str, effect_description: str) -> str:
86
+ prefix, postfix = create_effect_span_prefix_postfix(effect_description=effect_description)
87
+ res = f"{prefix}{text}{postfix}"
88
+ return res
89
+
90
+
91
+ def create_regular_span(text: str, bg_color: str) -> str:
92
+ """Create a regular HTML span with background color."""
93
+ return f'<span class="character-segment" style="background-color: {bg_color}">{text}</span>'
94
+
95
+
96
+ def _generate_legend_for_text_split_html(
97
+ character_phrases: list[CharacterPhrase], add_effect_legend: bool = False
98
+ ) -> str:
99
+ html = (
100
+ "<div style='margin-bottom: 1rem;'>"
101
+ "<div style='font-size: 1.35em; font-weight: bold;'>Legend:</div>"
102
+ )
103
+
104
+ unique_characters = set(phrase.character or 'Unassigned' for phrase in character_phrases)
105
+ characters_sorted = sorted(unique_characters, key=lambda c: c.lower())
106
+
107
+ for character in characters_sorted:
108
+ color = get_character_color(character)
109
+ html += f"<div style='color: {color}; font-size: 1.1em; margin-bottom: 0.25rem;'>{character}</div>"
110
+
111
+ if add_effect_legend:
112
+ html += (
113
+ '<div style="font-size: 1.1em; margin-bottom: 0.25rem;">'
114
+ '<span class="effect-text">🎡 #1</span>'
115
+ ' - sound effect start position (hover to see the prompt)'
116
+ '</div>'
117
+ )
118
+
119
+ html += "</div>"
120
+ return html
121
+
122
+
123
+ def _generate_text_split_html(
124
+ character_phrases: list[CharacterPhrase],
125
+ ) -> tuple[str, dict[int, int]]:
126
+ html_items = ["<div style='font-size: 1.2em; line-height: 1.6;'>"]
127
+
128
+ index_mapping = {} # Mapping from original index to HTML index
129
+ orig_index = 0 # Index in the original text
130
+ html_index = len(html_items[0]) # Index in the HTML output
131
+
132
+ for phrase in character_phrases:
133
+ character = phrase.character or 'Unassigned'
134
+ text = phrase.text
135
+ color = get_character_color(character)
136
+ rgba_color = f"rgba({hex_to_rgb(color)}, 0.5)"
137
+
138
+ prefix = f"<span style='background-color: {rgba_color}; border-radius: 0.2em;'>"
139
+ suffix = '</span>'
140
+
141
+ # Append the HTML for this phrase
142
+ html_items.append(f"{prefix}{text}{suffix}")
143
+
144
+ # Map each character index from the original text to the HTML text
145
+ html_index += len(prefix)
146
+ for i in range(len(text)):
147
+ index_mapping[orig_index + i] = html_index + i
148
+ # Update indices
149
+ orig_index += len(text)
150
+ html_index += len(text) + len(suffix)
151
+
152
+ html_items.append("</div>")
153
+
154
+ html = ''.join(html_items)
155
+ return html, index_mapping
156
+
157
+
158
+ def generate_text_split_inner_html_no_effect(character_phrases: list[CharacterPhrase]) -> str:
159
+ legend_html = _generate_legend_for_text_split_html(
160
+ character_phrases=character_phrases, add_effect_legend=False
161
+ )
162
+ text_split_html, char_ix_orig_2_html = _generate_text_split_html(
163
+ character_phrases=character_phrases
164
+ )
165
+ return legend_html + text_split_html
166
+
167
+
168
+ def generate_text_split_inner_html_with_effects(
169
+ character_phrases: list[CharacterPhrase],
170
+ sound_effects_descriptions: list[SoundEffectDescription],
171
+ ) -> str:
172
+ legend_html = _generate_legend_for_text_split_html(
173
+ character_phrases=character_phrases, add_effect_legend=True
174
+ )
175
+ text_split_html, char_ix_orig_2_html = _generate_text_split_html(
176
+ character_phrases=character_phrases
177
+ )
178
+
179
+ if not sound_effects_descriptions:
180
+ return legend_html + text_split_html
181
+
182
+ prev_end = 0
183
+ content_html_parts = []
184
+ for ix, sed in enumerate(sound_effects_descriptions, start=1):
185
+ # NOTE: 'sed' contains approximate indices from the original text.
186
+ # that's why we use safe conversion before accessing char mapping
187
+ ix_start = get_collection_safe_index(
188
+ ix=sed.ix_start_orig_text, collection=char_ix_orig_2_html
189
+ )
190
+ # ix_end = get_collection_safe_index(ix=sed.ix_end_orig_text, collection=char_ix_orig_2_html)
191
+
192
+ html_start_ix = char_ix_orig_2_html[ix_start]
193
+ # html_end_ix = char_ix_orig_2_html[ix_end] # NOTE: this is incorrect
194
+ # BUG: here we take exact same number of characters as in text between sound effect tags.
195
+ # This introduces the bug: HTML text could be included in 'text_under_effect',
196
+ # due to inaccuracies in 'sed' indices.
197
+ # html_end_ix = html_start_ix + ix_end - ix_start # NOTE: this is correct
198
+ # NOTE: reason is that html may exist between original text characters
199
+
200
+ prefix = text_split_html[prev_end:html_start_ix]
201
+ if prefix:
202
+ content_html_parts.append(prefix)
203
+
204
+ # text_under_effect = text_split_html[html_start_ix:html_end_ix]
205
+ text_under_effect = f'🎡 #{ix}'
206
+ if text_under_effect:
207
+ effect_prefix, effect_postfix = create_effect_span_prefix_postfix(
208
+ effect_description=sed.prompt
209
+ )
210
+ text_under_effect_wrapped = f'{effect_prefix}{text_under_effect}{effect_postfix}'
211
+ content_html_parts.append(text_under_effect_wrapped)
212
+
213
+ # prev_end = html_end_ix
214
+ prev_end = html_start_ix
215
+
216
+ last = text_split_html[prev_end:]
217
+ if last:
218
+ content_html_parts.append(last)
219
+
220
+ content_html = ''.join(content_html_parts)
221
+ content_html = f'{EFFECT_CSS}<div class="text-effect-container">{content_html}</div>'
222
+ html = legend_html + content_html
223
+ return html
224
+
225
+
226
+ def generate_voice_mapping_inner_html(select_voice_chain_out):
227
+ character2props = {}
228
+ html = AUDIO_PLAYER_CSS
229
+
230
+ for key in set(select_voice_chain_out.character2props) | set(
231
+ select_voice_chain_out.character2voice
232
+ ):
233
+ character_props = select_voice_chain_out.character2props.get(key, []).model_dump()
234
+ character_props["voice_id"] = select_voice_chain_out.character2voice.get(key, [])
235
+ character_props["sample_audio_url"] = get_audio_from_voice_id(character_props["voice_id"])
236
+
237
+ character2props[prettify_unknown_character_label(key)] = character_props
238
+
239
+ for character, voice_properties in sorted(character2props.items(), key=lambda x: x[0].lower()):
240
+ color = get_character_color(character)
241
+ audio_url = voice_properties.get('sample_audio_url', '')
242
+
243
+ html += f'''
244
+ <div class="voice-assignment">
245
+ <div class="voice-details">
246
+ <span class="character-name" style="color: {color};">{character}</span>
247
+ <span>β†’</span>
248
+ <span class="voice-props">
249
+ Gender: {voice_properties.get('gender', 'N/A')},
250
+ Age: {voice_properties.get('age_group', 'N/A')},
251
+ Voice ID: {voice_properties.get('voice_id', 'N/A')}
252
+ </span>
253
+ </div>
254
+ <div class="custom-audio-player">
255
+ <audio controls preload="none">
256
+ <source src="{audio_url}" type="audio/mpeg">
257
+ Your browser does not support the audio element.
258
+ </audio>
259
+ </div>
260
+ </div>
261
+ '''
262
+
263
+ return html
264
+
265
+
266
+ AUDIO_PLAYER_CSS = """\
267
+ <style>
268
+ .custom-audio-player {
269
+ display: inline-block;
270
+ width: 250px;
271
+ --bg-color: #ff79c6;
272
+ --highlight-color: #4299e100;
273
+ --text-color: #e0e0e0;
274
+ --border-radius: 0px;
275
+ }
276
+
277
+ .custom-audio-player audio {
278
+ width: 100%;
279
+ height: 36px;
280
+ border-radius: var(--border-radius);
281
+ background-color: #3f2a2a00;
282
+ outline: none;
283
+ }
284
+
285
+ .custom-audio-player audio::-webkit-media-controls-panel {
286
+ background-color: var(--bg-color);
287
+ }
288
+
289
+ .custom-audio-player audio::-webkit-media-controls-current-time-display,
290
+ .custom-audio-player audio::-webkit-media-controls-time-remaining-display {
291
+ color: var(--text-color);
292
+ }
293
+
294
+ .custom-audio-player audio::-webkit-media-controls-play-button {
295
+ background-color: var(--highlight-color);
296
+ border-radius: 50%;
297
+ height: 30px;
298
+ width: 30px;
299
+ }
300
+
301
+ .custom-audio-player audio::-webkit-media-controls-timeline {
302
+ background-color: var(--bg-color);
303
+ height: 6px;
304
+ border-radius: 3px;
305
+ }
306
+
307
+ /* Container styles for voice assignment display */
308
+ .voice-assignment {
309
+ background-color: rgba(49, 57, 82, 0.8);
310
+ padding: 1rem;
311
+ padding-left: 1rem;
312
+ padding-right: 1rem;
313
+ padding-top: 0.2rem;
314
+ padding-bottom: 0.2rem;
315
+ border-radius: var(--border-radius);
316
+ margin-top: 0.5rem;
317
+ color: var(--text-color);
318
+ display: flex;
319
+ align-items: center;
320
+ justify-content: space-between;
321
+ flex-wrap: wrap;
322
+ gap: 1rem;
323
+ border-radius: 7px;
324
+ }
325
+
326
+ .voice-assignment span {
327
+ font-weight: 600;
328
+ }
329
+
330
+ .voice-details {
331
+ display: flex;
332
+ align-items: center;
333
+ gap: 0.5rem;
334
+ }
335
+
336
+ .character-name {
337
+ color: var(--highlight-color);
338
+ font-weight: bold;
339
+ }
340
+
341
+ .voice-props {
342
+ color: #4a5568;
343
+ }
344
+ </style>
345
+ """
src/web/variables.py ADDED
@@ -0,0 +1,517 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from src.config import ELEVENLABS_API_KEY
2
+
3
+ DESCRIPTION_JS = """function createGradioAnimation() {
4
+ // Create main container
5
+ var container = document.createElement('div');
6
+ container.id = 'gradio-animation';
7
+ container.style.padding = '2rem';
8
+ container.style.background = 'transparent';
9
+ container.style.borderRadius = '12px';
10
+ container.style.margin = '0 0 2rem 0';
11
+ container.style.maxWidth = '100%';
12
+ container.style.transition = 'all 0.3s ease';
13
+
14
+ // Create header section
15
+ var header = document.createElement('div');
16
+ header.style.textAlign = 'center';
17
+ header.style.marginBottom = '2rem';
18
+ container.appendChild(header);
19
+
20
+ // Title with spaces
21
+ var titleText = 'AI Audio Books πŸ“•πŸ‘¨β€πŸ’»πŸŽ§';
22
+ var title = document.createElement('h1');
23
+ title.style.fontSize = '2.5rem';
24
+ title.style.fontWeight = '700';
25
+ title.style.color = '#f1f1f1';
26
+ title.style.marginBottom = '1.5rem';
27
+ title.style.opacity = '0'; // Start with opacity 0
28
+ title.style.transition = 'opacity 0.5s ease'; // Add transition
29
+ title.innerText = titleText;
30
+ header.appendChild(title);
31
+
32
+ // Add description
33
+ var description = document.createElement('p');
34
+ description.innerHTML = `
35
+ <div style="font-size: 1.1rem; color: #c0c0c0; margin-bottom: 2rem; line-height: 1.6;">
36
+ Create an audiobook from the input text automatically, using Gen-AI!<br>
37
+ All you need to do - is to input the book text or select it from the provided Sample Inputs.
38
+ </div>
39
+ `;
40
+ description.style.opacity = '0';
41
+ description.style.transition = 'opacity 0.5s ease';
42
+ header.appendChild(description);
43
+
44
+ // Create process section
45
+ var processSection = document.createElement('div');
46
+ processSection.style.backgroundColor = 'rgba(255, 255, 255, 0.05)';
47
+ processSection.style.padding = '1.5rem';
48
+ processSection.style.borderRadius = '8px';
49
+ processSection.style.marginTop = '1rem';
50
+ container.appendChild(processSection);
51
+
52
+ // Add "AI will do the rest:" header
53
+ var processHeader = document.createElement('div');
54
+ processHeader.style.fontSize = '1.2rem';
55
+ processHeader.style.fontWeight = '600';
56
+ processHeader.style.color = '#e0e0e0';
57
+ processHeader.style.marginBottom = '1rem';
58
+ processHeader.innerHTML = 'AI will do the rest:';
59
+ processHeader.style.opacity = '0';
60
+ processHeader.style.transition = 'opacity 0.5s ease';
61
+ processSection.appendChild(processHeader);
62
+
63
+ // Define steps with icons
64
+ var steps = [
65
+ { text: 'Split text into characters', icon: 'πŸ“š' },
66
+ { text: 'Select voice for each character', icon: '🎭' },
67
+ { text: 'Enhance text to convey emotions and intonations during Text-to-Speech', icon: '😊' },
68
+ { text: 'Generate audiobook using Text-to-Speech model', icon: '🎧' },
69
+ { text: 'Generate sound effects to create immersive atmosphere (optional)', icon: '🎡' },
70
+ { text: 'Clone your voice to generate the audiobook (optional)', icon: 'πŸ’₯' },
71
+ ];
72
+
73
+ // Create steps list
74
+ var stepsList = document.createElement('div');
75
+ stepsList.style.opacity = '0';
76
+ stepsList.style.transition = 'opacity 0.5s ease';
77
+ processSection.appendChild(stepsList);
78
+
79
+ steps.forEach(function(step, index) {
80
+ var stepElement = document.createElement('div');
81
+ stepElement.style.display = 'flex';
82
+ stepElement.style.alignItems = 'center';
83
+ stepElement.style.padding = '0.8rem';
84
+ stepElement.style.marginBottom = '0.5rem';
85
+ stepElement.style.backgroundColor = 'rgba(255, 255, 255, 0.03)';
86
+ stepElement.style.borderRadius = '6px';
87
+ stepElement.style.transform = 'translateX(-20px)';
88
+ stepElement.style.opacity = '0';
89
+ stepElement.style.transition = 'all 0.3s ease';
90
+
91
+ // Add hover effect
92
+ stepElement.onmouseover = function() {
93
+ this.style.backgroundColor = 'rgba(255, 255, 255, 0.07)';
94
+ };
95
+ stepElement.onmouseout = function() {
96
+ this.style.backgroundColor = 'rgba(255, 255, 255, 0.03)';
97
+ };
98
+
99
+ var icon = document.createElement('span');
100
+ icon.style.marginRight = '1rem';
101
+ icon.style.fontSize = '1.2rem';
102
+ icon.innerText = step.icon;
103
+ stepElement.appendChild(icon);
104
+
105
+ var text = document.createElement('span');
106
+ text.style.color = '#c0c0c0';
107
+ text.style.fontSize = '1rem';
108
+ text.innerText = step.text;
109
+ stepElement.appendChild(text);
110
+
111
+ stepsList.appendChild(stepElement);
112
+ });
113
+
114
+ // Insert into Gradio container
115
+ var gradioContainer = document.querySelector('.gradio-container');
116
+ gradioContainer.insertBefore(container, gradioContainer.firstChild);
117
+
118
+ // New timing for animations
119
+ setTimeout(function() {
120
+ title.style.opacity = '1';
121
+ }, 250);
122
+
123
+ // Show description after 1 second
124
+ setTimeout(function() {
125
+ description.style.opacity = '1';
126
+ processHeader.style.opacity = '1';
127
+ }, 700);
128
+
129
+ // Show steps after 2 seconds
130
+ setTimeout(function() {
131
+ stepsList.style.opacity = '1';
132
+ stepsList.querySelectorAll('div').forEach(function(step, index) {
133
+ setTimeout(function() {
134
+ step.style.transform = 'translateX(0)';
135
+ step.style.opacity = '1';
136
+ }, index * 100);
137
+ });
138
+ }, 1100);
139
+
140
+ async function playAudio(url) {
141
+ try {
142
+ const audio = new Audio(url);
143
+ await audio.play();
144
+ } catch (error) {
145
+ console.error('Error playing audio:', error);
146
+ }
147
+ }
148
+
149
+ // Add click handler to all audio links
150
+ document.addEventListener('click', function(e) {
151
+ if (e.target.classList.contains('audio-link')) {
152
+ e.preventDefault();
153
+ playAudio(e.target.getAttribute('data-audio-url'));
154
+ }
155
+ });
156
+
157
+ return 'Animation created';
158
+ }"""
159
+
160
+ STATUS_DISPLAY_HTML = '''
161
+ <style>
162
+ .status-container {
163
+ font-family: system-ui;
164
+ max-width: 1472;
165
+ margin: 0 auto;
166
+ background-color: #31395294; /* Darker background color */
167
+ padding: 1rem;
168
+ border-radius: 8px;
169
+ color: #f0f0f0; /* Light text color */
170
+ }
171
+ .status-header {
172
+ background: #31395294; /* Slightly lighter background */
173
+ padding: 1rem;
174
+ border-radius: 8px;
175
+ font-weight: bold; /* Emphasize header */
176
+ }
177
+ .status-title {
178
+ margin: 0;
179
+ color: rgb(224, 224, 224); /* White color for title */
180
+ font-size: 1.5rem; /* Larger title font */
181
+ font-weight: 700; /* Bold title */
182
+ }
183
+ .status-description {
184
+ margin: 0.5rem 0 0 0;
185
+ color: #c0c0c0;
186
+ font-size: 1rem;
187
+ font-weight: 400; /* Regular weight for description */
188
+ }
189
+ .steps {
190
+ margin-top: 1rem;
191
+ }
192
+ .step-item {
193
+ display: flex;
194
+ align-items: center;
195
+ padding: 0.8rem;
196
+ margin-bottom: 0.5rem;
197
+ background-color: #31395294; /* Matching background color */
198
+ border-radius: 6px;
199
+ color: #f0f0f0; /* Light text color */
200
+ font-weight: 600; /* Medium weight for steps */
201
+ }
202
+ .step-item:hover {
203
+ background-color: rgba(255, 255, 255, 0.07);
204
+ }
205
+ .step-icon {
206
+ margin-right: 1rem;
207
+ font-size: 1.3rem; /* Slightly larger icon size */
208
+ }
209
+ .step-text {
210
+ font-size: 1.1rem; /* Larger text for step description */
211
+ color: #e0e0e0; /* Lighter text for better readability */
212
+ }
213
+ </style>
214
+
215
+ <div class="status-container">
216
+ <div class="status-header">
217
+ <h2 class="status-title">Status: Waiting to Start</h2>
218
+ <p class="status-description">Enter text or upload a file to begin.</p>
219
+ </div>
220
+ </div>
221
+ '''
222
+ GRADIO_THEME = "freddyaboulton/dracula_revamped"
223
+
224
+ VOICE_UPLOAD_JS = f"""
225
+ async function createVoiceUploadPopup() {{
226
+ try {{
227
+ let savedVoiceId = null;
228
+ const result = await new Promise((resolve, reject) => {{
229
+ // Create overlay with soft animation
230
+ const overlay = document.createElement('div');
231
+ Object.assign(overlay.style, {{
232
+ position: 'fixed',
233
+ top: '0',
234
+ left: '0',
235
+ width: '100%',
236
+ height: '100%',
237
+ backgroundColor: 'rgba(0, 0, 0, 0.8)',
238
+ display: 'flex',
239
+ justifyContent: 'center',
240
+ alignItems: 'center',
241
+ zIndex: '1000',
242
+ opacity: '0',
243
+ transition: 'opacity 0.3s ease-in-out'
244
+ }});
245
+
246
+ overlay.offsetHeight; // Trigger reflow for transition
247
+ overlay.style.opacity = '1';
248
+
249
+ // Create popup container with modern design
250
+ const popup = document.createElement('div');
251
+ Object.assign(popup.style, {{
252
+ backgroundColor: '#3b4c63',
253
+ padding: '30px',
254
+ borderRadius: '12px',
255
+ width: '450px',
256
+ maxWidth: '95%',
257
+ position: 'relative',
258
+ boxShadow: '0 10px 25px rgba(0, 0, 0, 0.3)',
259
+ transform: 'scale(0.9)',
260
+ transition: 'transform 0.3s ease-out',
261
+ display: 'flex',
262
+ flexDirection: 'column',
263
+ alignItems: 'center'
264
+ }});
265
+
266
+ popup.offsetHeight; // Trigger reflow
267
+ popup.style.transform = 'scale(1)';
268
+
269
+ // Create close button
270
+ const closeBtn = document.createElement('button');
271
+ Object.assign(closeBtn.style, {{
272
+ position: 'absolute',
273
+ right: '15px',
274
+ top: '15px',
275
+ border: 'none',
276
+ background: 'none',
277
+ fontSize: '24px',
278
+ cursor: 'pointer',
279
+ color: '#d3d3d3',
280
+ transition: 'color 0.2s ease'
281
+ }});
282
+ closeBtn.innerHTML = 'βœ•';
283
+ closeBtn.onmouseover = () => closeBtn.style.color = '#ffffff';
284
+ closeBtn.onmouseout = () => closeBtn.style.color = '#d3d3d3';
285
+
286
+ // Create content
287
+ const content = document.createElement('div');
288
+ content.innerHTML = `
289
+ <div style="text-align: center; margin-bottom: 25px;">
290
+ <h2 style="color: #ffffff; margin: 0; font-size: 22px;">Upload Voice Sample</h2>
291
+ <p style="color: #b0b0b0; margin-top: 10px; font-size: 14px;">
292
+ Select an audio file to create audiobook with your unique voice.
293
+ </p>
294
+ </div>
295
+ <div style="margin-bottom: 20px; display: flex; flex-direction: column; align-items: center; width: 100%;">
296
+ <label for="voiceFile" style="
297
+ display: block;
298
+ margin-bottom: 10px;
299
+ color: #c0c0c0;
300
+ font-weight: 600;
301
+ text-align: center;">
302
+ Choose Audio File (MP3, WAV, OGG):
303
+ </label>
304
+ <input type="file" id="voiceFile" accept="audio/*"
305
+ style="
306
+ width: 100%;
307
+ padding: 12px;
308
+ border: 2px dashed #4a6f91;
309
+ border-radius: 8px;
310
+ background-color: #2a3a50;
311
+ color: #ffffff;
312
+ text-align: center;
313
+ transition: border-color 0.3s ease;
314
+ ">
315
+ </div>
316
+ <div id="uploadStatus" style="
317
+ margin-bottom: 15px;
318
+ text-align: center;
319
+ min-height: 25px;
320
+ color: #d3d3d3;">
321
+ </div>
322
+ <button id="uploadBtn" style="
323
+ background-color: #4a6f91;
324
+ color: #ffffff;
325
+ padding: 12px 20px;
326
+ border: none;
327
+ border-radius: 8px;
328
+ cursor: pointer;
329
+ width: 100%;
330
+ font-weight: 600;
331
+ transition: background-color 0.3s ease, transform 0.1s ease;
332
+ ">
333
+ Upload Voice
334
+ </button>
335
+ `;
336
+
337
+ // Add elements to DOM
338
+ popup.appendChild(closeBtn);
339
+ popup.appendChild(content);
340
+ overlay.appendChild(popup);
341
+ document.body.appendChild(overlay);
342
+
343
+ // Button effects
344
+ const uploadBtn = popup.querySelector('#uploadBtn');
345
+ uploadBtn.onmouseover = () => uploadBtn.style.backgroundColor = '#3b5c77';
346
+ uploadBtn.onmouseout = () => uploadBtn.style.backgroundColor = '#4a6f91';
347
+ uploadBtn.onmousedown = () => uploadBtn.style.transform = 'scale(0.98)';
348
+ uploadBtn.onmouseup = () => uploadBtn.style.transform = 'scale(1)';
349
+
350
+ // Handle close
351
+ const handleClose = () => {{
352
+ overlay.style.opacity = '0';
353
+ setTimeout(() => {{
354
+ overlay.remove();
355
+ resolve(savedVoiceId);
356
+ }}, 300);
357
+ }};
358
+
359
+ closeBtn.onclick = handleClose;
360
+ overlay.onclick = (e) => {{
361
+ if (e.target === overlay) {{
362
+ handleClose();
363
+ }}
364
+ }};
365
+
366
+ // Handle file upload
367
+ const statusDiv = popup.querySelector('#uploadStatus');
368
+ const fileInput = popup.querySelector('#voiceFile');
369
+
370
+ uploadBtn.onclick = async () => {{
371
+ const file = fileInput.files[0];
372
+ if (!file) {{
373
+ statusDiv.textContent = 'Please select a file first.';
374
+ statusDiv.style.color = '#e74c3c';
375
+ return;
376
+ }}
377
+
378
+ const API_KEY = "{ELEVENLABS_API_KEY}";
379
+
380
+ statusDiv.textContent = 'Uploading...';
381
+ statusDiv.style.color = '#4a6f91';
382
+ uploadBtn.disabled = true;
383
+ uploadBtn.style.backgroundColor = '#6c8091';
384
+
385
+ const formData = new FormData();
386
+ formData.append('files', file);
387
+ formData.append('name', `voice_${{Date.now()}}`);
388
+
389
+ try {{
390
+ const response = await fetch('https://api.elevenlabs.io/v1/voices/add', {{
391
+ method: 'POST',
392
+ headers: {{
393
+ 'Accept': 'application/json',
394
+ 'xi-api-key': API_KEY
395
+ }},
396
+ body: formData
397
+ }});
398
+
399
+ const result = await response.json();
400
+
401
+ if (response.ok) {{
402
+ savedVoiceId = result.voice_id
403
+ statusDiv.innerHTML = `
404
+ <div style="
405
+ background-color: #2e3e50;
406
+ color: #00b894;
407
+ padding: 10px;
408
+ border-radius: 6px;
409
+ font-weight: 600;
410
+ ">
411
+ Voice uploaded successfully!
412
+ <br>Your Voice ID: <span style="color: #0984e3;">${{result.voice_id}}</span>
413
+ </div>
414
+ `;
415
+
416
+ // Update the visible HTML panel
417
+ const voiceIdPanel = document.querySelector('#voice_id_panel');
418
+ if (voiceIdPanel) {{
419
+ voiceIdPanel.innerHTML = `<strong>Your voice_id from uploaded audio is </strong> <span style="color: #0984e3;">${{result.voice_id}}</span>`;
420
+ }}
421
+
422
+ setTimeout(() => {{
423
+ overlay.style.opacity = '0';
424
+ setTimeout(() => {{
425
+ overlay.remove();
426
+ resolve(result.voice_id); // Resolve with the voice ID
427
+ }}, 300);
428
+ }}, 3000);
429
+ }} else {{
430
+ throw new Error(result.detail?.message || 'Upload failed');
431
+ }}
432
+ }} catch (error) {{
433
+ statusDiv.innerHTML = `
434
+ <div style="
435
+ background-color: #3b4c63;
436
+ color: #d63031;
437
+ padding: 10px;
438
+ border-radius: 6px;
439
+ font-weight: 600;
440
+ ">
441
+ Error: ${{error.message}}
442
+ </div>
443
+ `;
444
+ uploadBtn.disabled = false;
445
+ uploadBtn.style.backgroundColor = '#4a6f91';
446
+ }}
447
+ }};
448
+ }});
449
+
450
+ return result; // Return the voice ID from the Promise
451
+ }} catch (error) {{
452
+ console.error('Error in createVoiceUploadPopup:', error);
453
+ return null;
454
+ }}
455
+ }}
456
+ """
457
+
458
+ EFFECT_CSS = """\
459
+ <style>
460
+ .text-effect-container {
461
+ line-height: 1.6;
462
+ }
463
+
464
+ .character-segment {
465
+ border-radius: 0.2em;
466
+ }
467
+
468
+ .effect-container {
469
+ position: relative;
470
+ display: inline-block;
471
+ }
472
+
473
+ .effect-text {
474
+ border-radius: 13px;
475
+ border: 2px solid rgba(251, 224, 5, 0.91);
476
+ cursor: help;
477
+ color: rgba(53, 53, 53, 0.97) !important;
478
+ background-color: #ffffffd9;
479
+ font-size: 0.9em;
480
+ padding-left: 0.3em !important;
481
+ padding-right: 0.3em !important;
482
+ }
483
+
484
+ .effect-tooltip {
485
+ visibility: hidden;
486
+ background-color: #333;
487
+ color: white;
488
+ text-align: center;
489
+ padding: 5px 10px;
490
+ border-radius: 6px;
491
+ position: absolute;
492
+ z-index: 1;
493
+ bottom: 125%;
494
+ left: 50%;
495
+ transform: translateX(-50%);
496
+ white-space: nowrap;
497
+ opacity: 0;
498
+ transition: opacity 0.3s;
499
+ }
500
+
501
+ .effect-tooltip::after {
502
+ content: "";
503
+ position: absolute;
504
+ top: 100%;
505
+ left: 50%;
506
+ margin-left: -5px;
507
+ border-width: 5px;
508
+ border-style: solid;
509
+ border-color: #333 transparent transparent transparent;
510
+ }
511
+
512
+ .effect-container:hover .effect-tooltip {
513
+ visibility: visible;
514
+ opacity: 1;
515
+ }
516
+ </style>
517
+ """