whisper-webui-translate

Sleeping

App Files Files Community

avans06 commited on Dec 17, 2023

Commit

ec7cc5c

•

1 Parent(s): 8e80889

Added the "Whisper Segments Filter" option along with some configuration adjustments.

Browse files

1. Added the Whisper Segments Filter option, which, when enabled, can effectively improve the whisper hallucination, especially for the large-v3 version of the whisper model.

2. Set the Word Timestamps option to enable by default.

3. The textarea for outputting Transcription and Segments now supports displaying a scrollbar.
_________

## Whisper Filter options
**This is an experimental feature and may potentially filter out correct transcription results.**

when enabled, can effectively improve the whisper hallucination, especially for the large-v3 version of the whisper model.

Observations for transcriptions:
1. duration: calculated by subtracting start from end, it might indicate hallucinated results when inversely proportional to text length.
2. segment_last: the last result for each segment during VAD transcription has a certain probability of being a hallucinated result.
3. avg_logprob: average log probability, ranging from logprob_threshold (default: -1) to 0, is better when a larger value. A value lower than -0.9 might suggest a poor result.
4. compression_ratio: gzip compression ratio, ranging from 0 to compression_ratio_threshold (default: 2.4), a higher positive value is preferable. If it is lower than 0.9, it might indicate suboptimal results.
5. no_speech_prob: no_speech(<|nospeech|> token) probability, ranging from 0 to no_speech_threshold (default: 0.6), a smaller positive value is preferable. If it exceeds 0.1, it might suggest suboptimal results.

Four sets of filtering conditions have now been established, utilizing text length, duration length, as well as the avg_logprob, compression_ratio, and no_speech_prob parameters returned by Whisper.
1. avg_logprob < -0.9
2. (durationLen < 1.5 || segment_last), textLen > 5, avg_logprob < -0.4, no_speech_prob > 0.5
3. (durationLen < 1.5 || segment_last), textLen > 5, avg_logprob < -0.4, no_speech_prob > 0.07, compression_ratio < 0.9
4. (durationLen < 1.5 || segment_last), compression_ratio < 0.9, no_speech_prob > 0.1

Files changed (5) hide show

app.py +139 -19
config.json5 +21 -1
docs/options.md +18 -0
src/config.py +8 -3
src/vad.py +5 -1

app.py CHANGED Viewed

@@ -1,7 +1,7 @@
 from datetime import datetime
 import json
 import math
-from typing import Iterator, Union, List
 import argparse
 from io import StringIO
@@ -244,14 +244,38 @@ class WhisperTranscriber:
             microphoneData: str  = decodeOptions.pop("microphoneData")
             task:           str  = decodeOptions.pop("task")
-            vad:                 str   = decodeOptions.pop("vad")
-            vadMergeWindow:      float = decodeOptions.pop("vadMergeWindow")
-            vadMaxMergeSize:     float = decodeOptions.pop("vadMaxMergeSize")
-            vadPadding:          float = decodeOptions.pop("vadPadding", self.app_config.vad_padding)
-            vadPromptWindow:     float = decodeOptions.pop("vadPromptWindow", self.app_config.vad_prompt_window)
-            vadInitialPromptMode: str  = decodeOptions.pop("vadInitialPromptMode", self.app_config.vad_initial_prompt_mode)
-            self.vad_process_timeout:    float = decodeOptions.pop("vadPocessTimeout", self.vad_process_timeout)
             diarization:              bool = decodeOptions.pop("diarization", False)
             diarization_speakers:     int  = decodeOptions.pop("diarization_speakers", 2)
             diarization_min_speakers: int  = decodeOptions.pop("diarization_min_speakers", 1)
@@ -388,6 +412,10 @@ class WhisperTranscriber:
                     # Transcribe
                     result = self.transcribe_file(model, source.source_path, whisperLangCode, task, vadOptions, scaled_progress_listener, **decodeOptions)
                     if translationModel is not None and whisperLang is None and result["language"] is not None and len(result["language"]) > 0:
                         whisperLang = get_lang_from_whisper_code(result["language"])
                         translationModel.whisperLang = whisperLang
@@ -466,8 +494,8 @@ class WhisperTranscriber:
                             zip.write(download_file, arcname=zip_file_name)
                     download.insert(0, downloadAllPath)
-                return download, text, vtt
             finally:
                 # Cleanup source
@@ -481,10 +509,10 @@ class WhisperTranscriber:
                             print("Error deleting temporary source file: \n" + source.source_path + ", \n" + str(e))
         except ExceededMaximumDuration as e:
-            return [], ("[ERROR]: Maximum remote video length is " + str(e.maxDuration) + "s, file was " + str(e.videoDuration) + "s"), "[ERROR]"
         except Exception as e:
             print(traceback.format_exc())
-            return [], ("Error occurred during transcribe: " + str(e)), traceback.format_exc()
     def transcribe_file(self, model: AbstractWhisperContainer, audio_path: str, languageCode: str, task: str = None,
@@ -549,7 +577,13 @@ class WhisperTranscriber:
             else:
                 # Default VAD
                 result = whisperCallable.invoke(audio_path, 0, None, None, progress_listener=progressListener)
         # Diarization
         if self.diarization and self.diarization_kwargs:
             print("Diarizing ", audio_path)
@@ -564,6 +598,68 @@ class WhisperTranscriber:
             result = self.diarization.mark_speakers(diarization_result, result)
         return result
     def _create_progress_listener(self, progress: gr.Progress):
         if (progress is None):
@@ -874,6 +970,11 @@ def create_ui(app_config: ApplicationConfig):
         gr.Checkbox(label="Word Timestamps", value=app_config.word_timestamps, elem_id="word_timestamps"),
         gr.Checkbox(label="Word Timestamps - Highlight Words", value=app_config.highlight_words, elem_id="highlight_words"),
     }
     has_diarization_libs = Diarization.has_libraries()
@@ -889,15 +990,30 @@ def create_ui(app_config: ApplicationConfig):
     }
     common_output = lambda : [
-        gr.File(label="Download"),
-        gr.Text(label="Transcription", autoscroll=False),
-        gr.Text(label="Segments", autoscroll=False),
     ]
     is_queue_mode = app_config.queue_concurrency_count is not None and app_config.queue_concurrency_count > 0
     simpleInputDict = {}
     with gr.Blocks() as simpleTranscribe:
         simpleTranslateInput = gr.State(value="m2m100", elem_id = "translateInput")
         simpleSourceInput = gr.State(value="urlData", elem_id = "sourceInput")
@@ -939,6 +1055,8 @@ def create_ui(app_config: ApplicationConfig):
                         simpleInputDict.update(common_vad_inputs())
                     with gr.Accordion("Word Timestamps options", open=False):
                         simpleInputDict.update(common_word_timestamps_inputs())
                     with gr.Accordion("Diarization options", open=False):
                         simpleInputDict.update(common_diarization_inputs())
                     with gr.Accordion("Translation options", open=False):
@@ -957,7 +1075,7 @@ def create_ui(app_config: ApplicationConfig):
                 gr.Markdown(readmeMd)
         simpleInputDict.update({simpleTranslateInput, simpleSourceInput})
-        simpleSubmit.click(fn=ui.transcribe_webui_simple_progress if is_queue_mode else ui.transcribe_webui_simple,
                     inputs=simpleInputDict, outputs=simpleOutput)
     fullInputDict = {}
@@ -1032,6 +1150,8 @@ def create_ui(app_config: ApplicationConfig):
                                 gr.Number(label="Repetition Penalty", value=app_config.repetition_penalty, elem_id = "repetition_penalty"),
                                 gr.Number(label="No Repeat Ngram Size", value=app_config.no_repeat_ngram_size, precision=0, elem_id = "no_repeat_ngram_size")
                             })
                     with gr.Accordion("Diarization options", open=False):
                         fullInputDict.update(common_diarization_inputs())
                     with gr.Accordion("Translation options", open=False):
@@ -1051,7 +1171,7 @@ def create_ui(app_config: ApplicationConfig):
         fullSubmit.click(fn=ui.transcribe_webui_full_progress if is_queue_mode else ui.transcribe_webui_full,
                     inputs=fullInputDict, outputs=fullOutput)
-    demo = gr.TabbedInterface([simpleTranscribe, fullTranscribe], tab_names=["Simple", "Full"])
     # Queue up the demo
     if is_queue_mode:

 from datetime import datetime
 import json
 import math
+from typing import Iterator, Union, List, Dict, Any
 import argparse
 from io import StringIO
             microphoneData: str  = decodeOptions.pop("microphoneData")
             task:           str  = decodeOptions.pop("task")
+            vad:                      str   = decodeOptions.pop("vad")
+            vadMergeWindow:           float = decodeOptions.pop("vadMergeWindow")
+            vadMaxMergeSize:          float = decodeOptions.pop("vadMaxMergeSize")
+            vadPadding:               float = decodeOptions.pop("vadPadding", self.app_config.vad_padding)
+            vadPromptWindow:          float = decodeOptions.pop("vadPromptWindow", self.app_config.vad_prompt_window)
+            vadInitialPromptMode:     str   = decodeOptions.pop("vadInitialPromptMode", self.app_config.vad_initial_prompt_mode)
+            self.vad_process_timeout: float = decodeOptions.pop("vadPocessTimeout", self.vad_process_timeout)
+            self.whisperSegmentsFilters: List[List] = []
+            inputFilter: bool = decodeOptions.pop("whisperSegmentsFilter", None)
+            inputFilters = []
+            for idx in range(0,len(self.app_config.whisper_segments_filters),1):
+                inputFilters.append(decodeOptions.pop(f"whisperSegmentsFilter{idx}", None))
+            inputFilters = filter(None, inputFilters)
+            if inputFilter:
+                for inputFilter in inputFilters:
+                    self.whisperSegmentsFilters.append([])
+                    self.whisperSegmentsFilters[-1].append(inputFilter)
+                    for text in inputFilter.split(","):
+                        result = []
+                        subFilter = [text] if "||" not in text else [strFilter_ for strFilter_ in text.lstrip("(").rstrip(")").split("||") if strFilter_]
+                        for string in subFilter:
+                            conditions = [condition for condition in string.split(" ") if condition]
+                            if len(conditions) == 1 and conditions[0] == "segment_last":
+                                pass
+                            elif len(conditions) == 3:
+                                conditions[-1] = float(conditions[-1])
+                            else:
+                                continue
+                            result.append(conditions)
+                        self.whisperSegmentsFilters[-1].append(result)
             diarization:              bool = decodeOptions.pop("diarization", False)
             diarization_speakers:     int  = decodeOptions.pop("diarization_speakers", 2)
             diarization_min_speakers: int  = decodeOptions.pop("diarization_min_speakers", 1)
                     # Transcribe
                     result = self.transcribe_file(model, source.source_path, whisperLangCode, task, vadOptions, scaled_progress_listener, **decodeOptions)
+                    filterLog = result.get("filterLog", None)
+                    filterLogText = [gr.Text.update(visible=False)]
+                    if filterLog:
+                        filterLogText = [gr.Text.update(visible=True, value=filterLog)]
                     if translationModel is not None and whisperLang is None and result["language"] is not None and len(result["language"]) > 0:
                         whisperLang = get_lang_from_whisper_code(result["language"])
                         translationModel.whisperLang = whisperLang
                             zip.write(download_file, arcname=zip_file_name)
                     download.insert(0, downloadAllPath)
+                return [download, text, vtt] + filterLogText
             finally:
                 # Cleanup source
                             print("Error deleting temporary source file: \n" + source.source_path + ", \n" + str(e))
         except ExceededMaximumDuration as e:
+            return [], "[ERROR]: Maximum remote video length is " + str(e.maxDuration) + "s, file was " + str(e.videoDuration) + "s", "[ERROR]", ""
         except Exception as e:
             print(traceback.format_exc())
+            return [], "Error occurred during transcribe: " + str(e), traceback.format_exc(), ""
     def transcribe_file(self, model: AbstractWhisperContainer, audio_path: str, languageCode: str, task: str = None,
             else:
                 # Default VAD
                 result = whisperCallable.invoke(audio_path, 0, None, None, progress_listener=progressListener)
+        if self.whisperSegmentsFilters:
+            querySegmentsResult, filterLog = self.filterSegments(result["segments"])
+            result["segments"] = querySegmentsResult
+            if filterLog:
+                result["filterLog"] = filterLog
         # Diarization
         if self.diarization and self.diarization_kwargs:
             print("Diarizing ", audio_path)
             result = self.diarization.mark_speakers(diarization_result, result)
         return result
+    def filterSegments(self, querySegments: List[Dict[str, Any]]):
+        try:
+            if not self.whisperSegmentsFilters: return
+            filterIdx = 0
+            filterLog = []
+            querySegmentsResult = querySegments.copy()
+            for idx in range(len(querySegmentsResult),0,-1):
+                currentID = idx - 1
+                querySegment = querySegmentsResult[currentID]
+                for segmentsFilter in self.whisperSegmentsFilters:
+                    isFilter: bool = True
+                    for idx, strFilter in enumerate(segmentsFilter):
+                        if not isFilter: break
+                        if idx == 0:
+                            filterCondition = strFilter
+                            continue
+                        isFilter = True
+                        for subFilter in strFilter:
+                            key: str = subFilter[0]
+                            if key == "segment_last":
+                                isFilter = querySegment.get(key, None)
+                                if isFilter: break
+                                continue
+                            sign: str = subFilter[1]
+                            threshold: float = subFilter[2]
+                            if key == "durationLen":
+                                value = querySegment["end"] - querySegment["start"]
+                            elif key == "textLen":
+                                value = len(querySegment["text"])
+                            else:
+                                value = querySegment[key]
+                            if sign == "=" or sign == "==":
+                                isFilter = value == threshold
+                            elif sign == ">":
+                                isFilter = value > threshold
+                            elif sign == ">=":
+                                isFilter = value >= threshold
+                            elif sign == "<":
+                                isFilter = value < threshold
+                            elif sign == "<=":
+                                isFilter = value <= threshold
+                            else: isFilter = False
+                            if isFilter: break
+                    if isFilter: break
+                if isFilter:
+                    filterIdx += 1
+                    filterLog.append(f"filter{filterIdx:03d} [{filterCondition}]:")
+                    filterLog.append(f"\t{querySegment}\n")
+                    del querySegmentsResult[currentID]
+            return querySegmentsResult, "\n".join(filterLog)
+        except Exception as e:
+            print(traceback.format_exc())
+            print("Error filter segments: " + str(e))
     def _create_progress_listener(self, progress: gr.Progress):
         if (progress is None):
         gr.Checkbox(label="Word Timestamps", value=app_config.word_timestamps, elem_id="word_timestamps"),
         gr.Checkbox(label="Word Timestamps - Highlight Words", value=app_config.highlight_words, elem_id="highlight_words"),
     }
+    common_segments_filter_inputs = lambda : {
+        gr.Checkbox(label="Whisper Segments Filter", value=app_config.whisper_segments_filter, elem_id="whisperSegmentsFilter") if idx == 0 else
+        gr.Text(label=f"Filter {idx}", value=filterStr, elem_id=f"whisperSegmentsFilter{idx}") for idx, filterStr in enumerate([""] + app_config.whisper_segments_filters)
+    }
     has_diarization_libs = Diarization.has_libraries()
     }
     common_output = lambda : [
+        gr.File(label="Download", elem_id="outputDownload"),
+        gr.Text(label="Transcription", autoscroll=False, show_copy_button=True, interactive=True, elem_id="outputTranscription", elem_classes="scroll-show"),
+        gr.Text(label="Segments", autoscroll=False, show_copy_button=True, interactive=True, elem_id="outputSegments", elem_classes="scroll-show"),
+        gr.Text(label="Filtered segment items", autoscroll=False, visible=False, show_copy_button=True, interactive=True, elem_id="outputFiltered", elem_classes="scroll-show"),
     ]
+    css = """
+.scroll-show textarea {
+    overflow-y: auto !important;
+}
+.scroll-show textarea::-webkit-scrollbar {
+    all: initial !important;
+    background: #f1f1f1 !important;
+}
+.scroll-show textarea::-webkit-scrollbar-thumb {
+    all: initial !important;
+    background: #a8a8a8 !important;
+}
+"""
     is_queue_mode = app_config.queue_concurrency_count is not None and app_config.queue_concurrency_count > 0
     simpleInputDict = {}
     with gr.Blocks() as simpleTranscribe:
         simpleTranslateInput = gr.State(value="m2m100", elem_id = "translateInput")
         simpleSourceInput = gr.State(value="urlData", elem_id = "sourceInput")
                         simpleInputDict.update(common_vad_inputs())
                     with gr.Accordion("Word Timestamps options", open=False):
                         simpleInputDict.update(common_word_timestamps_inputs())
+                    with gr.Accordion("Whisper Filter options", open=False):
+                        simpleInputDict.update(common_segments_filter_inputs())
                     with gr.Accordion("Diarization options", open=False):
                         simpleInputDict.update(common_diarization_inputs())
                     with gr.Accordion("Translation options", open=False):
                 gr.Markdown(readmeMd)
         simpleInputDict.update({simpleTranslateInput, simpleSourceInput})
+        simpleSubmit.click(fn=ui.transcribe_webui_simple_progress if is_queue_mode else ui.transcribe_webui_simple,
                     inputs=simpleInputDict, outputs=simpleOutput)
     fullInputDict = {}
                                 gr.Number(label="Repetition Penalty", value=app_config.repetition_penalty, elem_id = "repetition_penalty"),
                                 gr.Number(label="No Repeat Ngram Size", value=app_config.no_repeat_ngram_size, precision=0, elem_id = "no_repeat_ngram_size")
                             })
+                    with gr.Accordion("Whisper Segments Filter options", open=False):
+                        fullInputDict.update(common_segments_filter_inputs())
                     with gr.Accordion("Diarization options", open=False):
                         fullInputDict.update(common_diarization_inputs())
                     with gr.Accordion("Translation options", open=False):
         fullSubmit.click(fn=ui.transcribe_webui_full_progress if is_queue_mode else ui.transcribe_webui_full,
                     inputs=fullInputDict, outputs=fullOutput)
+    demo = gr.TabbedInterface([simpleTranscribe, fullTranscribe], tab_names=["Simple", "Full"], css=css)
     # Queue up the demo
     if is_queue_mode:

config.json5 CHANGED Viewed

@@ -317,9 +317,13 @@
   "logprob_threshold": -1.0,
   // If the probability of the <no-speech> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence
   "no_speech_threshold": 0.6,
   // (experimental) extract word-level timestamps and refine the results based on them
-  "word_timestamps": false,
   // if word_timestamps is True, merge these punctuation symbols with the next word
   "prepend_punctuations": "\"\'“¿([{-",
   // if word_timestamps is True, merge these punctuation symbols with the previous word
@@ -339,4 +343,20 @@
   "diarization_max_speakers": 8,
   // The number of seconds before inactivate processes are terminated. Use 0 to close processes immediately, or None for no timeout.
   "diarization_process_timeout": 60,
 }

   "logprob_threshold": -1.0,
   // If the probability of the <no-speech> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence
   "no_speech_threshold": 0.6,
+  // [faster-whisper] The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.
+  "repetition_penalty": 1.0,
+  // [faster-whisper] The model ensures that a sequence of words of no_repeat_ngram_size isn’t repeated in the output sequence. If specified, it must be a positive integer greater than 1.
+  "no_repeat_ngram_size": 0,
   // (experimental) extract word-level timestamps and refine the results based on them
+  "word_timestamps": true,
   // if word_timestamps is True, merge these punctuation symbols with the next word
   "prepend_punctuations": "\"\'“¿([{-",
   // if word_timestamps is True, merge these punctuation symbols with the previous word
   "diarization_max_speakers": 8,
   // The number of seconds before inactivate processes are terminated. Use 0 to close processes immediately, or None for no timeout.
   "diarization_process_timeout": 60,
+  // Whisper Segments Filter
+  "whisper_segments_filter": false,
+  "whisper_segments_filters": [
+    "avg_logprob < -0.9",
+    "(durationLen < 1.5 || segment_last), textLen > 5, avg_logprob < -0.4, no_speech_prob > 0.5",
+    "(durationLen < 1.5 || segment_last), textLen > 5, avg_logprob < -0.4, no_speech_prob > 0.07, compression_ratio < 0.9",
+    "(durationLen < 1.5 || segment_last), compression_ratio < 0.9, no_speech_prob > 0.1"
+  ],
+  // Translation - The maximum batch size.
+  "translation_batch_size": 2,
+  // Translation - Prevent repetitions of ngrams with this size (set 0 to disable).
+  "translation_no_repeat_ngram_size": 3,
+  // Translation - Beam size (1 for greedy search).
+  "translation_num_beams": 2,
 }

docs/options.md CHANGED Viewed

@@ -166,6 +166,24 @@ Penalty applied to the score of previously generated tokens (set > 1 to penalize
 This parameter only takes effect in [faster-whisper (ctranslate2)](https://github.com/SYSTRAN/faster-whisper/issues/478).
 Prevent repetitions of ngrams with this size (set 0 to disable).
 ## Translation - Batch Size
 - transformers: batch_size
 When the pipeline will use DataLoader (when passing a dataset, on GPU for a Pytorch model), the size of the batch to use, for inference this is not always beneficial.

 This parameter only takes effect in [faster-whisper (ctranslate2)](https://github.com/SYSTRAN/faster-whisper/issues/478).
 Prevent repetitions of ngrams with this size (set 0 to disable).
+## Whisper Filter options
+**This is an experimental feature and may potentially filter out correct transcription results.**
+when enabled, can effectively improve the whisper hallucination, especially for the large-v3 version of the whisper model.
+Observations for transcriptions:
+1. duration: calculated by subtracting start from end, it might indicate hallucinated results when inversely proportional to text length.
+1. segment_last: the last result for each segment during VAD transcription has a certain probability of being a hallucinated result.
+1. avg_logprob: average log probability, ranging from logprob_threshold (default: -1) to 0, is better when a larger value. A value lower than -0.9 might suggest a poor result.
+1. compression_ratio: gzip compression ratio, ranging from 0 to compression_ratio_threshold (default: 2.4), a higher positive value is preferable. If it is lower than 0.9, it might indicate suboptimal results.
+1. no_speech_prob: no_speech(<|nospeech|> token) probability, ranging from 0 to no_speech_threshold (default: 0.6), a smaller positive value is preferable. If it exceeds 0.1, it might suggest suboptimal results.
+Four sets of filtering conditions have now been established, utilizing text length, duration length, as well as the avg_logprob, compression_ratio, and no_speech_prob parameters returned by Whisper.
+1. avg_logprob < -0.9
+1. (durationLen < 1.5 || segment_last), textLen > 5, avg_logprob < -0.4, no_speech_prob > 0.5
+1. (durationLen < 1.5 || segment_last), textLen > 5, avg_logprob < -0.4, no_speech_prob > 0.07, compression_ratio < 0.9
+1. (durationLen < 1.5 || segment_last), compression_ratio < 0.9, no_speech_prob > 0.1
 ## Translation - Batch Size
 - transformers: batch_size
 When the pipeline will use DataLoader (when passing a dataset, on GPU for a Pytorch model), the size of the batch to use, for inference this is not always beneficial.

src/config.py CHANGED Viewed

@@ -54,7 +54,7 @@ class ApplicationConfig:
                  input_audio_max_duration: int = 600, share: bool = False, server_name: str = None, server_port: int = 7860,
                  queue_concurrency_count: int = 1, delete_uploaded_files: bool = True,
                  whisper_implementation: str = "whisper", default_model_name: str = "medium",
-                 default_nllb_model_name: str = "distilled-600M", default_vad: str = "silero-vad",
                  vad_parallel_devices: str = "", vad_cpu_cores: int = 1, vad_process_timeout: int = 1800,
                  auto_parallel: bool = False, output_dir: str = None,
                  model_dir: str = None, device: str = None,
@@ -71,7 +71,7 @@ class ApplicationConfig:
                  logprob_threshold: float = -1.0, no_speech_threshold: float = 0.6,
                  repetition_penalty: float = 1.0, no_repeat_ngram_size: int = 0,
                  # Word timestamp settings
-                 word_timestamps: bool = False, prepend_punctuations: str = "\"\'“¿([{-",
                  append_punctuations: str = "\"\'.。,，!！?？:：”)]}、",
                  highlight_words: bool = False,
                  # Diarization
@@ -82,6 +82,9 @@ class ApplicationConfig:
                  translation_batch_size: int = 2,
                  translation_no_repeat_ngram_size: int = 3,
                  translation_num_beams: int = 2,
                  ):
         self.models = models
@@ -96,7 +99,6 @@ class ApplicationConfig:
         self.whisper_implementation = whisper_implementation
         self.default_model_name = default_model_name
-        self.default_nllb_model_name = default_nllb_model_name
         self.default_vad = default_vad
         self.vad_parallel_devices = vad_parallel_devices
         self.vad_cpu_cores = vad_cpu_cores
@@ -148,6 +150,9 @@ class ApplicationConfig:
         self.translation_batch_size = translation_batch_size
         self.translation_no_repeat_ngram_size = translation_no_repeat_ngram_size
         self.translation_num_beams = translation_num_beams
     def get_model_names(self, name: str):
         return [ x.name for x in self.models[name] ]

                  input_audio_max_duration: int = 600, share: bool = False, server_name: str = None, server_port: int = 7860,
                  queue_concurrency_count: int = 1, delete_uploaded_files: bool = True,
                  whisper_implementation: str = "whisper", default_model_name: str = "medium",
+                 default_vad: str = "silero-vad",
                  vad_parallel_devices: str = "", vad_cpu_cores: int = 1, vad_process_timeout: int = 1800,
                  auto_parallel: bool = False, output_dir: str = None,
                  model_dir: str = None, device: str = None,
                  logprob_threshold: float = -1.0, no_speech_threshold: float = 0.6,
                  repetition_penalty: float = 1.0, no_repeat_ngram_size: int = 0,
                  # Word timestamp settings
+                 word_timestamps: bool = True, prepend_punctuations: str = "\"\'“¿([{-",
                  append_punctuations: str = "\"\'.。,，!！?？:：”)]}、",
                  highlight_words: bool = False,
                  # Diarization
                  translation_batch_size: int = 2,
                  translation_no_repeat_ngram_size: int = 3,
                  translation_num_beams: int = 2,
+                 # Whisper Segments Filter
+                 whisper_segments_filter: bool = False,
+                 whisper_segments_filters: List[str] = [],
                  ):
         self.models = models
         self.whisper_implementation = whisper_implementation
         self.default_model_name = default_model_name
         self.default_vad = default_vad
         self.vad_parallel_devices = vad_parallel_devices
         self.vad_cpu_cores = vad_cpu_cores
         self.translation_batch_size = translation_batch_size
         self.translation_no_repeat_ngram_size = translation_no_repeat_ngram_size
         self.translation_num_beams = translation_num_beams
+        # Whisper Segments Filter
+        self.whisper_segments_filter = whisper_segments_filter
+        self.whisper_segments_filters = whisper_segments_filters
     def get_model_names(self, name: str):
         return [ x.name for x in self.models[name] ]

src/vad.py CHANGED Viewed

@@ -219,7 +219,11 @@ class AbstractTranscription(ABC):
                 perf_end_time = time.perf_counter()
                 print("\tWhisper took {} seconds".format(perf_end_time - perf_start_time))
-                adjusted_segments = self.adjust_timestamp(segment_result["segments"], adjust_seconds=segment_start, max_source_time=segment_duration)
                 # Propagate expand amount to the segments
                 if (segment_expand_amount > 0):

                 perf_end_time = time.perf_counter()
                 print("\tWhisper took {} seconds".format(perf_end_time - perf_start_time))
+                adjusted_segments: List[Dict[str, Any]] = self.adjust_timestamp(segment_result["segments"], adjust_seconds=segment_start, max_source_time=segment_duration)
+                if len(adjusted_segments) > 0:
+                    adjusted_segments[0]["segment_first"] = True
+                    adjusted_segments[-1]["segment_last"] = True
                 # Propagate expand amount to the segments
                 if (segment_expand_amount > 0):