Upload 120 files
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitattributes +36 -35
- .gitignore +2 -0
- Dockerfile +23 -0
- README.md +11 -1
- app.py +212 -0
- bot.py +148 -0
- requirements.txt +3 -0
- saver.py +12 -0
- style.css +6 -0
- temp/gradio/0f3530471c34c76f9ae3d4371e3d7ee514337803d3b005765f4b841429f39ae0/c4_5k_sentences.txt +3 -0
- temp/gradio/1eed12f805ac5fc69765cf073c0789f2fa0beb45a2889954fea80e3502712fc7/gutenberg_27045.txt +3 -0
- temp/gradio/2eccd11097e5d097d30fbc0bb75385de2026208e447113380e02d9d664265e62/ligurian_2.mp3 +0 -0
- temp/gradio/7f3cc334a792175a240a628a2882c1a78e68a77954fbde6a2eaa56045701acac/english.mp3 +0 -0
- temp/gradio/87da63f0aabe1f9ef4833ae4548c4c073cbb5efc691051e498b29c51b5e547ca/ligurian_1.mp3 +0 -0
- temp/gradio/9434d7872cbe7d4964ebf21282f60e80d5b0724e46ae4f4244607397cf3f2954/zenamt_10k_sentences.txt +3 -0
- temp/gradio/9f0eda2f99e1153e171d6bf42aa23964e84c9324b01443f7d50c9f809e915107/ngen_he_whistles_often.wav +0 -0
- temp/gradio/a41ab02f4a892bb3192c2659932fda1ac6b2d4a4f132906fbd93b664e8175ea5/zenamt_5k_sentences.txt +3 -0
- temp/gradio/c96725fec391b6a10de94bfab9f87e2ff33d2c4e39353cd2bc0e127bb23f0f0f/Ngen_lexicon_16112024.txt +3 -0
- temp/gradio/e19724c317effb656ec46271bcf29a7643dab34b063649fa743a50ed2b0febb6/ligurian_3.mp3 +0 -0
- temp/gradio/eeb33b271d6ae4dd706c5d41e191181798b08290b7192db7ad689c7e898ed442/c4_10k_sentences.txt +3 -0
- temp/gradio/tmpq09r5cf8 +0 -0
- upload/english/c4_10k_sentences.txt +3 -0
- upload/english/c4_25k_sentences.txt +3 -0
- upload/english/c4_5k_sentences.txt +3 -0
- upload/english/cv8_top10k_words.txt +3 -0
- upload/english/english.mp3 +0 -0
- upload/english/english_full.mp3 +0 -0
- upload/english/gutenberg_27045.txt +3 -0
- upload/english/ngen_lexicon_jan_2025.txt +3 -0
- upload/ligurian/1.m4a +0 -0
- upload/ligurian/2.m4a +0 -0
- upload/ligurian/3.m4a +0 -0
- upload/ligurian/ligurian_1.mp3 +0 -0
- upload/ligurian/ligurian_1.txt +3 -0
- upload/ligurian/ligurian_2.mp3 +0 -0
- upload/ligurian/ligurian_2.txt +3 -0
- upload/ligurian/ligurian_3.mp3 +0 -0
- upload/ligurian/ligurian_3.txt +3 -0
- upload/ligurian/zenamt_10k_sentences.txt +3 -0
- upload/ligurian/zenamt_5k_sentences.txt +3 -0
- upload/mms_zs/config.json +108 -0
- upload/mms_zs/model.safetensors +3 -0
- upload/mms_zs/preprocessor_config.json +10 -0
- upload/mms_zs/special_tokens_map.json +6 -0
- upload/mms_zs/tokenizer_config.json +48 -0
- upload/mms_zs/tokens.txt +3 -0
- upload/mms_zs/vocab.json +34 -0
- uroman/.gitignore +35 -0
- uroman/LICENSE.txt +3 -0
- uroman/README.md +165 -0
.gitattributes
CHANGED
@@ -1,35 +1,36 @@
|
|
1 |
-
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
-
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
-
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
-
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
-
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
-
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
-
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
-
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
-
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
-
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
-
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
-
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
-
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
-
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
-
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
-
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
-
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
-
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
-
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
-
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
-
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
-
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
-
*.pth filter=lfs diff=lfs merge=lfs -text
|
24 |
-
*.rar filter=lfs diff=lfs merge=lfs -text
|
25 |
-
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
-
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
-
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
-
*.tar filter=lfs diff=lfs merge=lfs -text
|
29 |
-
*.tflite filter=lfs diff=lfs merge=lfs -text
|
30 |
-
*.tgz filter=lfs diff=lfs merge=lfs -text
|
31 |
-
*.wasm filter=lfs diff=lfs merge=lfs -text
|
32 |
-
*.xz filter=lfs diff=lfs merge=lfs -text
|
33 |
-
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
-
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
-
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
*.txt filter=lfs diff=lfs merge=lfs -text
|
.gitignore
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
__pycache__
|
2 |
+
.env
|
Dockerfile
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
FROM python:3.12-slim
|
2 |
+
|
3 |
+
# Установка необходимых системных зависимостей
|
4 |
+
RUN apt-get update && apt-get install -y \
|
5 |
+
build-essential \
|
6 |
+
libsndfile1 \
|
7 |
+
perl \
|
8 |
+
&& rm -rf /var/lib/apt/lists/*
|
9 |
+
|
10 |
+
# Создание рабочей директории
|
11 |
+
WORKDIR /app
|
12 |
+
|
13 |
+
# Копирование файла requirements.txt
|
14 |
+
COPY requirements.txt .
|
15 |
+
|
16 |
+
# Установка зависимостей Python
|
17 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
18 |
+
|
19 |
+
# Копирование остальных файлов проекта
|
20 |
+
COPY . .
|
21 |
+
|
22 |
+
# Запуск бота
|
23 |
+
CMD ["python", "bot.py"]
|
README.md
CHANGED
@@ -1,3 +1,13 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
1 |
---
|
2 |
+
title: Mms Zeroshot
|
3 |
+
emoji: 🌍
|
4 |
+
colorFrom: red
|
5 |
+
colorTo: gray
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 4.36.1
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
license: cc-by-nc-4.0
|
11 |
---
|
12 |
+
|
13 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
app.py
ADDED
@@ -0,0 +1,212 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import gradio as gr
|
2 |
+
from zeroshot import (
|
3 |
+
process,
|
4 |
+
WORD_SCORE_DEFAULT_IF_NOLM,
|
5 |
+
)
|
6 |
+
import os
|
7 |
+
import logging
|
8 |
+
from pathlib import Path
|
9 |
+
|
10 |
+
# Configure logging
|
11 |
+
logging.basicConfig(
|
12 |
+
level=logging.INFO,
|
13 |
+
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
|
14 |
+
)
|
15 |
+
logger = logging.getLogger(__name__)
|
16 |
+
|
17 |
+
# Set specific directory path
|
18 |
+
TEMP_DIR = Path("D:/Ngen/bot/temp_lexicon")
|
19 |
+
|
20 |
+
def ensure_temp_dir():
|
21 |
+
"""Create and ensure access to the specific directory."""
|
22 |
+
try:
|
23 |
+
# Create temp directory if it doesn't exist
|
24 |
+
TEMP_DIR.mkdir(parents=True, exist_ok=True)
|
25 |
+
logger.info(f"Created or verified temp directory at {TEMP_DIR}")
|
26 |
+
|
27 |
+
# Test write permissions
|
28 |
+
test_file = TEMP_DIR / 'test_write'
|
29 |
+
try:
|
30 |
+
test_file.touch()
|
31 |
+
test_file.unlink() # Remove test file
|
32 |
+
logger.info("Successfully verified write permissions")
|
33 |
+
except Exception as e:
|
34 |
+
logger.error(f"Failed to write to directory {TEMP_DIR}: {e}")
|
35 |
+
raise
|
36 |
+
|
37 |
+
return str(TEMP_DIR)
|
38 |
+
except Exception as e:
|
39 |
+
logger.error(f"Failed to create or access directory {TEMP_DIR}: {e}")
|
40 |
+
raise
|
41 |
+
|
42 |
+
# Create temporary directory at startup
|
43 |
+
TEMP_PATH = ensure_temp_dir()
|
44 |
+
os.environ['TEMP_LEXICON_DIR'] = TEMP_PATH
|
45 |
+
logger.info(f"Set TEMP_LEXICON_DIR environment variable to {TEMP_PATH}")
|
46 |
+
|
47 |
+
def process_wrapper(audio, words_file, wscore, wscore_usedefault, reference):
|
48 |
+
"""Обертка для функции process с фиксированными параметрами"""
|
49 |
+
generator = process(
|
50 |
+
audio_data=audio,
|
51 |
+
words_file=words_file,
|
52 |
+
lm_path=None,
|
53 |
+
wscore=wscore,
|
54 |
+
lmscore=None,
|
55 |
+
wscore_usedefault=wscore_usedefault,
|
56 |
+
lmscore_usedefault=True,
|
57 |
+
autolm=False,
|
58 |
+
reference=reference
|
59 |
+
)
|
60 |
+
|
61 |
+
# Получаем последний результат из генератора
|
62 |
+
transcription = ""
|
63 |
+
logs = ""
|
64 |
+
for trans, log in generator:
|
65 |
+
transcription += trans
|
66 |
+
logs += log
|
67 |
+
|
68 |
+
return transcription, logs
|
69 |
+
|
70 |
+
def create_gradio_interface():
|
71 |
+
"""Create and configure the Gradio interface"""
|
72 |
+
with gr.Blocks(css="style.css") as demo:
|
73 |
+
gr.Markdown(
|
74 |
+
"<p align='center' style='font-size: 20px;'>MMS Zero-shot ASR Demo</p>"
|
75 |
+
)
|
76 |
+
gr.HTML(
|
77 |
+
"""<center>The demo works on input audio in any language, as long as you provide a list of words or sentences for that language.<br>We recommend having a minimum of 10000 sentences in the textfile to achieve a good performance.</center>"""
|
78 |
+
)
|
79 |
+
|
80 |
+
with gr.Row():
|
81 |
+
with gr.Column():
|
82 |
+
# Audio input section
|
83 |
+
audio = gr.Audio(
|
84 |
+
label="Audio Input\n(use microphone or upload a file)",
|
85 |
+
type="filepath"
|
86 |
+
)
|
87 |
+
|
88 |
+
with gr.Row():
|
89 |
+
words_file = gr.File(label="Text Data")
|
90 |
+
|
91 |
+
# Advanced settings section
|
92 |
+
with gr.Accordion("Advanced Settings", open=False):
|
93 |
+
gr.Markdown(
|
94 |
+
"The following parameters are used for beam-search decoding. Use the default values if you are not sure."
|
95 |
+
)
|
96 |
+
with gr.Row():
|
97 |
+
with gr.Column():
|
98 |
+
wscore_usedefault = gr.Checkbox(
|
99 |
+
label="Use Default Word Insertion Score",
|
100 |
+
value=True
|
101 |
+
)
|
102 |
+
wscore = gr.Slider(
|
103 |
+
minimum=-10.0,
|
104 |
+
maximum=10.0,
|
105 |
+
value=WORD_SCORE_DEFAULT_IF_NOLM,
|
106 |
+
step=0.1,
|
107 |
+
interactive=False,
|
108 |
+
label="Word Insertion Score",
|
109 |
+
)
|
110 |
+
|
111 |
+
btn = gr.Button("Submit", elem_id="submit")
|
112 |
+
|
113 |
+
# Slider update function
|
114 |
+
@gr.on(
|
115 |
+
inputs=[wscore_usedefault],
|
116 |
+
outputs=[wscore],
|
117 |
+
)
|
118 |
+
def update_slider(ws):
|
119 |
+
return gr.Slider(
|
120 |
+
minimum=-10.0,
|
121 |
+
maximum=10.0,
|
122 |
+
value=WORD_SCORE_DEFAULT_IF_NOLM,
|
123 |
+
step=0.1,
|
124 |
+
interactive=not ws,
|
125 |
+
label="Word Insertion Score",
|
126 |
+
)
|
127 |
+
|
128 |
+
# Output section
|
129 |
+
with gr.Column():
|
130 |
+
text = gr.Textbox(label="Transcript")
|
131 |
+
with gr.Accordion("Logs", open=False):
|
132 |
+
logs = gr.Textbox(show_label=False)
|
133 |
+
|
134 |
+
reference = gr.Textbox(label="Reference Transcript", visible=False)
|
135 |
+
|
136 |
+
# Process button click
|
137 |
+
btn.click(
|
138 |
+
fn=process_wrapper,
|
139 |
+
inputs=[
|
140 |
+
audio,
|
141 |
+
words_file,
|
142 |
+
wscore,
|
143 |
+
wscore_usedefault,
|
144 |
+
reference,
|
145 |
+
],
|
146 |
+
outputs=[text, logs],
|
147 |
+
)
|
148 |
+
|
149 |
+
# Example inputs
|
150 |
+
gr.Examples(
|
151 |
+
examples=[
|
152 |
+
[
|
153 |
+
"upload/english/english.mp3",
|
154 |
+
"upload/english/c4_10k_sentences.txt",
|
155 |
+
"This is going to look at the code that we have in our configuration that we've already exported and compare it to our database, and we want to import",
|
156 |
+
],
|
157 |
+
[
|
158 |
+
"upload/english/english.mp3",
|
159 |
+
"upload/english/c4_5k_sentences.txt",
|
160 |
+
"This is going to look at the code that we have in our configuration that we've already exported and compare it to our database, and we want to import",
|
161 |
+
],
|
162 |
+
[
|
163 |
+
"upload/english/english.mp3",
|
164 |
+
"upload/english/gutenberg_27045.txt",
|
165 |
+
"This is going to look at the code that we have in our configuration that we've already exported and compare it to our database, and we want to import",
|
166 |
+
],
|
167 |
+
],
|
168 |
+
inputs=[audio, words_file, reference],
|
169 |
+
label="English",
|
170 |
+
)
|
171 |
+
|
172 |
+
gr.Examples(
|
173 |
+
examples=[
|
174 |
+
[
|
175 |
+
"upload/ligurian/ligurian_1.mp3",
|
176 |
+
"upload/ligurian/zenamt_10k_sentences.txt",
|
177 |
+
"I mæ colleghi m'an domandou d'aggiuttâli à fâ unna preuva co-o zeneise pe vedde s'o fonçioña.",
|
178 |
+
],
|
179 |
+
[
|
180 |
+
"upload/ligurian/ligurian_2.mp3",
|
181 |
+
"upload/ligurian/zenamt_10k_sentences.txt",
|
182 |
+
"Staseia vaggo à çenâ con mæ moggê e doî amixi che de chì à quarche settemaña faian stramuo feua stato.",
|
183 |
+
],
|
184 |
+
[
|
185 |
+
"upload/ligurian/ligurian_3.mp3",
|
186 |
+
"upload/ligurian/zenamt_5k_sentences.txt",
|
187 |
+
"Pe inandiâ o pesto ghe veu o baxaicò, i pigneu, l'euio, o formaggio, l'aggio e a sâ.",
|
188 |
+
],
|
189 |
+
],
|
190 |
+
inputs=[audio, words_file, reference],
|
191 |
+
label="Ligurian",
|
192 |
+
)
|
193 |
+
|
194 |
+
return demo
|
195 |
+
|
196 |
+
def main():
|
197 |
+
try:
|
198 |
+
# Create and launch Gradio interface
|
199 |
+
demo = create_gradio_interface()
|
200 |
+
|
201 |
+
# Launch with specific host and port
|
202 |
+
demo.launch(
|
203 |
+
server_name='0.0.0.0',
|
204 |
+
server_port=7860,
|
205 |
+
show_error=True
|
206 |
+
)
|
207 |
+
except Exception as e:
|
208 |
+
logger.error(f"Failed to launch Gradio interface: {e}")
|
209 |
+
raise
|
210 |
+
|
211 |
+
if __name__ == "__main__":
|
212 |
+
main()
|
bot.py
ADDED
@@ -0,0 +1,148 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import telebot
|
2 |
+
import tempfile
|
3 |
+
import time
|
4 |
+
import os
|
5 |
+
from pathlib import Path
|
6 |
+
import logging
|
7 |
+
import soundfile as sf
|
8 |
+
import librosa
|
9 |
+
from zeroshot import WORD_SCORE_DEFAULT_IF_NOLM
|
10 |
+
from app import process_wrapper
|
11 |
+
import gradio as gr
|
12 |
+
from dotenv import load_dotenv
|
13 |
+
|
14 |
+
# Настройка логирования
|
15 |
+
logging.basicConfig(
|
16 |
+
level=logging.INFO,
|
17 |
+
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
|
18 |
+
)
|
19 |
+
logger = logging.getLogger(__name__)
|
20 |
+
|
21 |
+
load_dotenv()
|
22 |
+
|
23 |
+
# Конфигурация
|
24 |
+
BOT_TOKEN = os.getenv('BOT_TOKEN')
|
25 |
+
if not BOT_TOKEN:
|
26 |
+
raise ValueError("No BOT_TOKEN found in environment variables")
|
27 |
+
|
28 |
+
WORDS_FILE_PATH = 'upload/english/ngen_lexicon_jan_2025.txt'
|
29 |
+
|
30 |
+
# Инициализация бота
|
31 |
+
bot = telebot.TeleBot(BOT_TOKEN)
|
32 |
+
|
33 |
+
def convert_audio(input_path):
|
34 |
+
"""Конвертирует аудио в формат WAV с частотой 16kHz"""
|
35 |
+
try:
|
36 |
+
# Загружаем аудио и ресемплируем до 16kHz
|
37 |
+
y, sr = librosa.load(input_path, sr=16000)
|
38 |
+
|
39 |
+
# Создаем временный WAV файл
|
40 |
+
output_path = input_path.replace('.ogg', '.wav')
|
41 |
+
sf.write(output_path, y, sr, format='WAV')
|
42 |
+
|
43 |
+
return output_path
|
44 |
+
except Exception as e:
|
45 |
+
logger.error(f"Error converting audio: {e}")
|
46 |
+
raise
|
47 |
+
|
48 |
+
@bot.message_handler(commands=['start'])
|
49 |
+
def send_welcome(message):
|
50 |
+
welcome_text = (
|
51 |
+
"👋 Привет! Я бот для автоматического распознавания речи.\n\n"
|
52 |
+
"🎤 Отправьте мне голосовое сообщение или аудиофайл на английском языке.\n\n"
|
53 |
+
"ℹ️ Поддерживаются файлы любого формата, которые Telegram может обработать как аудио."
|
54 |
+
)
|
55 |
+
bot.reply_to(message, welcome_text)
|
56 |
+
|
57 |
+
@bot.message_handler(content_types=['audio', 'voice'])
|
58 |
+
def handle_audio(message):
|
59 |
+
try:
|
60 |
+
# Отправляем сообщение о начале обработки
|
61 |
+
processing_msg = bot.reply_to(message, "🔄 Обрабатываю аудио...")
|
62 |
+
|
63 |
+
# Получаем информацию о файле
|
64 |
+
if message.voice:
|
65 |
+
file_info = bot.get_file(message.voice.file_id)
|
66 |
+
else:
|
67 |
+
file_info = bot.get_file(message.audio.file_id)
|
68 |
+
|
69 |
+
logger.info(f"Processing file: {file_info.file_path}")
|
70 |
+
|
71 |
+
# Скачиваем файл
|
72 |
+
downloaded_file = bot.download_file(file_info.file_path)
|
73 |
+
|
74 |
+
# Создаем временный файл
|
75 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix='.ogg') as temp_audio:
|
76 |
+
temp_audio.write(downloaded_file)
|
77 |
+
temp_audio_path = temp_audio.name
|
78 |
+
|
79 |
+
# Конвертируем в WAV
|
80 |
+
wav_path = convert_audio(temp_audio_path)
|
81 |
+
logger.info(f"Converted to WAV: {wav_path}")
|
82 |
+
|
83 |
+
|
84 |
+
# Вызываем process_wrapper
|
85 |
+
transcription, logs = process_wrapper(
|
86 |
+
audio=wav_path,
|
87 |
+
words_file=WORDS_FILE_PATH,
|
88 |
+
wscore=WORD_SCORE_DEFAULT_IF_NOLM,
|
89 |
+
wscore_usedefault=True,
|
90 |
+
reference=None
|
91 |
+
)
|
92 |
+
|
93 |
+
logger.info(f"transcibe done!")
|
94 |
+
|
95 |
+
logger.info(f"tr:{transcription}, log:{logs}!")
|
96 |
+
|
97 |
+
# Удаляем временные файлы
|
98 |
+
os.unlink(temp_audio_path)
|
99 |
+
os.unlink(wav_path)
|
100 |
+
# Отправляем результат
|
101 |
+
if transcription:
|
102 |
+
bot.edit_message_text(
|
103 |
+
f"📝 Распознанный текст:\n\n{transcription}\n\n📊 Logs:\n{logs}",
|
104 |
+
chat_id=processing_msg.chat.id,
|
105 |
+
message_id=processing_msg.message_id
|
106 |
+
)
|
107 |
+
else:
|
108 |
+
raise ValueError("Empty transcription")
|
109 |
+
|
110 |
+
except Exception as e:
|
111 |
+
logger.error(f"Error processing audio: {e}")
|
112 |
+
error_message = (
|
113 |
+
"❌ Произошла ошибка при обработке аудио.\n\n"
|
114 |
+
"Убедитесь, что:\n"
|
115 |
+
"- Аудио содержит четкую английскую речь\n"
|
116 |
+
"- Длительность не превышает допустимую\n"
|
117 |
+
f"- Ошибка: {str(e)}"
|
118 |
+
)
|
119 |
+
# Если было сообщение о процессе - редактируем его
|
120 |
+
if 'processing_msg' in locals():
|
121 |
+
bot.edit_message_text(
|
122 |
+
error_message,
|
123 |
+
chat_id=processing_msg.chat.id,
|
124 |
+
message_id=processing_msg.message_id
|
125 |
+
)
|
126 |
+
else:
|
127 |
+
bot.reply_to(message, error_message)
|
128 |
+
|
129 |
+
def run_bot():
|
130 |
+
"""Запуск бота с обработкой ошибок и повторными попытками"""
|
131 |
+
while True:
|
132 |
+
try:
|
133 |
+
logger.info("Starting the bot...")
|
134 |
+
bot.polling(none_stop=True, timeout=60, long_polling_timeout=30)
|
135 |
+
logger.info("bot ended")
|
136 |
+
return
|
137 |
+
except (ConnectionError, ConnectionResetError, ConnectionAbortedError) as e:
|
138 |
+
logger.error(f"Connection error occurred: {e}")
|
139 |
+
logger.info("Waiting 15 seconds before reconnecting...")
|
140 |
+
time.sleep(15)
|
141 |
+
except Exception as e:
|
142 |
+
logger.error(f"Critical error: {e}")
|
143 |
+
logger.info("Waiting 15 seconds before restarting...")
|
144 |
+
time.sleep(15)
|
145 |
+
|
146 |
+
|
147 |
+
if __name__ == "__main__":
|
148 |
+
run_bot()
|
requirements.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:38fe967706983887c10bb422f6edbc2ca90860f5dddeed2bfc33d44c6e5076b1
|
3 |
+
size 267
|
saver.py
ADDED
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
class SaverCSV:
|
2 |
+
def __init__(self):
|
3 |
+
pass
|
4 |
+
|
5 |
+
'''
|
6 |
+
Сохраняет файл на постоянно и сохраняет для него транскрипцию
|
7 |
+
'''
|
8 |
+
def save_with_transcribe(self, audio_data, transcribe_data):
|
9 |
+
# save file (local or in IPFS)
|
10 |
+
|
11 |
+
# add trinscribe + link to file (for example file path or file hash for 16000 rate)
|
12 |
+
pass
|
style.css
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#submit {
|
2 |
+
margin: auto;
|
3 |
+
color: #fff;
|
4 |
+
background: #1565c0;
|
5 |
+
border-radius: 100vh;
|
6 |
+
}
|
temp/gradio/0f3530471c34c76f9ae3d4371e3d7ee514337803d3b005765f4b841429f39ae0/c4_5k_sentences.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8d869b186c5185702420066707c4b05665eacf0b85fe6ac2d6723973f23e21d0
|
3 |
+
size 10581246
|
temp/gradio/1eed12f805ac5fc69765cf073c0789f2fa0beb45a2889954fea80e3502712fc7/gutenberg_27045.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a6cb4e9c754924333e37dde766098f862ddd079c81009c77454f377c96b9ac19
|
3 |
+
size 84138
|
temp/gradio/2eccd11097e5d097d30fbc0bb75385de2026208e447113380e02d9d664265e62/ligurian_2.mp3
ADDED
Binary file (40.1 kB). View file
|
|
temp/gradio/7f3cc334a792175a240a628a2882c1a78e68a77954fbde6a2eaa56045701acac/english.mp3
ADDED
Binary file (21.6 kB). View file
|
|
temp/gradio/87da63f0aabe1f9ef4833ae4548c4c073cbb5efc691051e498b29c51b5e547ca/ligurian_1.mp3
ADDED
Binary file (38.5 kB). View file
|
|
temp/gradio/9434d7872cbe7d4964ebf21282f60e80d5b0724e46ae4f4244607397cf3f2954/zenamt_10k_sentences.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f2ea0347d01ca1b28f9774150f736a548b491e52c8fd41b365c79567e4178f69
|
3 |
+
size 1378407
|
temp/gradio/9f0eda2f99e1153e171d6bf42aa23964e84c9324b01443f7d50c9f809e915107/ngen_he_whistles_often.wav
ADDED
Binary file (631 kB). View file
|
|
temp/gradio/a41ab02f4a892bb3192c2659932fda1ac6b2d4a4f132906fbd93b664e8175ea5/zenamt_5k_sentences.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:10de84604206cf2383ee5bd31e0a7e64e489f0c805bd04a73d3764febe2a787f
|
3 |
+
size 689082
|
temp/gradio/c96725fec391b6a10de94bfab9f87e2ff33d2c4e39353cd2bc0e127bb23f0f0f/Ngen_lexicon_16112024.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:bdc687a2c66dc8011313f79d95a9f6eae50d95d20bc736b8d0308b15846348bc
|
3 |
+
size 29738
|
temp/gradio/e19724c317effb656ec46271bcf29a7643dab34b063649fa743a50ed2b0febb6/ligurian_3.mp3
ADDED
Binary file (35.3 kB). View file
|
|
temp/gradio/eeb33b271d6ae4dd706c5d41e191181798b08290b7192db7ad689c7e898ed442/c4_10k_sentences.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0913dc53e7125db53cef2830733e021ac9020353a2227e9aa1974fc91f81349e
|
3 |
+
size 20455097
|
temp/gradio/tmpq09r5cf8
ADDED
File without changes
|
upload/english/c4_10k_sentences.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0913dc53e7125db53cef2830733e021ac9020353a2227e9aa1974fc91f81349e
|
3 |
+
size 20455097
|
upload/english/c4_25k_sentences.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:922e679e338b153a2208c6a45cf7b4c6558ea08f2a847d2d1b41ed24a6fc0bfe
|
3 |
+
size 51696380
|
upload/english/c4_5k_sentences.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8d869b186c5185702420066707c4b05665eacf0b85fe6ac2d6723973f23e21d0
|
3 |
+
size 10581246
|
upload/english/cv8_top10k_words.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:469096ea6dade7e1f902020c07ae6bf9ab3b7bff590422de63ee94050010d2d1
|
3 |
+
size 78712
|
upload/english/english.mp3
ADDED
Binary file (21.6 kB). View file
|
|
upload/english/english_full.mp3
ADDED
Binary file (40.5 kB). View file
|
|
upload/english/gutenberg_27045.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a6cb4e9c754924333e37dde766098f862ddd079c81009c77454f377c96b9ac19
|
3 |
+
size 84138
|
upload/english/ngen_lexicon_jan_2025.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:35ab64d73c5443317966308fdb73e351c3ccedf3c8612095582a6c1d47bbf664
|
3 |
+
size 13267
|
upload/ligurian/1.m4a
ADDED
Binary file (232 kB). View file
|
|
upload/ligurian/2.m4a
ADDED
Binary file (228 kB). View file
|
|
upload/ligurian/3.m4a
ADDED
Binary file (205 kB). View file
|
|
upload/ligurian/ligurian_1.mp3
ADDED
Binary file (38.5 kB). View file
|
|
upload/ligurian/ligurian_1.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3ec314577cfa5f769492bf3fc0eca1fb688eed558862e32ea739f4d53d6c7c5d
|
3 |
+
size 106
|
upload/ligurian/ligurian_2.mp3
ADDED
Binary file (40.1 kB). View file
|
|
upload/ligurian/ligurian_2.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e16eb241f4d90bb863f9bdad39dd707e1ff163826d67c6dc09f5db487af79f88
|
3 |
+
size 112
|
upload/ligurian/ligurian_3.mp3
ADDED
Binary file (35.3 kB). View file
|
|
upload/ligurian/ligurian_3.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4ee9a91afe4800abcd67f7b66f8ac0bcfcf3f1525e2d76e5a69a31ef769ec907
|
3 |
+
size 92
|
upload/ligurian/zenamt_10k_sentences.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f2ea0347d01ca1b28f9774150f736a548b491e52c8fd41b365c79567e4178f69
|
3 |
+
size 1378407
|
upload/ligurian/zenamt_5k_sentences.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:10de84604206cf2383ee5bd31e0a7e64e489f0c805bd04a73d3764febe2a787f
|
3 |
+
size 689082
|
upload/mms_zs/config.json
ADDED
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"activation_dropout": 0.0,
|
3 |
+
"adapter_attn_dim": null,
|
4 |
+
"adapter_kernel_size": 3,
|
5 |
+
"adapter_stride": 2,
|
6 |
+
"add_adapter": false,
|
7 |
+
"apply_spec_augment": true,
|
8 |
+
"architectures": [
|
9 |
+
"Wav2Vec2ForCTC"
|
10 |
+
],
|
11 |
+
"attention_dropout": 0.1,
|
12 |
+
"bos_token_id": 1,
|
13 |
+
"classifier_proj_size": 256,
|
14 |
+
"codevector_dim": 768,
|
15 |
+
"contrastive_logits_temperature": 0.1,
|
16 |
+
"conv_bias": true,
|
17 |
+
"conv_dim": [
|
18 |
+
512,
|
19 |
+
512,
|
20 |
+
512,
|
21 |
+
512,
|
22 |
+
512,
|
23 |
+
512,
|
24 |
+
512
|
25 |
+
],
|
26 |
+
"conv_kernel": [
|
27 |
+
10,
|
28 |
+
3,
|
29 |
+
3,
|
30 |
+
3,
|
31 |
+
3,
|
32 |
+
2,
|
33 |
+
2
|
34 |
+
],
|
35 |
+
"conv_stride": [
|
36 |
+
5,
|
37 |
+
2,
|
38 |
+
2,
|
39 |
+
2,
|
40 |
+
2,
|
41 |
+
2,
|
42 |
+
2
|
43 |
+
],
|
44 |
+
"ctc_loss_reduction": "sum",
|
45 |
+
"ctc_zero_infinity": false,
|
46 |
+
"diversity_loss_weight": 0.1,
|
47 |
+
"do_stable_layer_norm": true,
|
48 |
+
"eos_token_id": 2,
|
49 |
+
"feat_extract_activation": "gelu",
|
50 |
+
"feat_extract_dropout": 0.0,
|
51 |
+
"feat_extract_norm": "layer",
|
52 |
+
"feat_proj_dropout": 0.1,
|
53 |
+
"feat_quantizer_dropout": 0.0,
|
54 |
+
"final_dropout": 0.0,
|
55 |
+
"gradient_checkpointing": false,
|
56 |
+
"hidden_act": "gelu",
|
57 |
+
"hidden_dropout": 0.1,
|
58 |
+
"hidden_size": 1024,
|
59 |
+
"initializer_range": 0.02,
|
60 |
+
"intermediate_size": 4096,
|
61 |
+
"layer_norm_eps": 1e-05,
|
62 |
+
"layerdrop": 0.1,
|
63 |
+
"mask_feature_length": 10,
|
64 |
+
"mask_feature_min_masks": 0,
|
65 |
+
"mask_feature_prob": 0.0,
|
66 |
+
"mask_time_length": 10,
|
67 |
+
"mask_time_min_masks": 2,
|
68 |
+
"mask_time_prob": 0.075,
|
69 |
+
"model_type": "wav2vec2",
|
70 |
+
"num_adapter_layers": 3,
|
71 |
+
"num_attention_heads": 16,
|
72 |
+
"num_codevector_groups": 2,
|
73 |
+
"num_codevectors_per_group": 320,
|
74 |
+
"num_conv_pos_embedding_groups": 16,
|
75 |
+
"num_conv_pos_embeddings": 128,
|
76 |
+
"num_feat_extract_layers": 7,
|
77 |
+
"num_hidden_layers": 24,
|
78 |
+
"num_negatives": 100,
|
79 |
+
"output_hidden_size": 1024,
|
80 |
+
"pad_token_id": 0,
|
81 |
+
"proj_codevector_dim": 768,
|
82 |
+
"tdnn_dilation": [
|
83 |
+
1,
|
84 |
+
2,
|
85 |
+
3,
|
86 |
+
1,
|
87 |
+
1
|
88 |
+
],
|
89 |
+
"tdnn_dim": [
|
90 |
+
512,
|
91 |
+
512,
|
92 |
+
512,
|
93 |
+
512,
|
94 |
+
1500
|
95 |
+
],
|
96 |
+
"tdnn_kernel": [
|
97 |
+
5,
|
98 |
+
3,
|
99 |
+
3,
|
100 |
+
1,
|
101 |
+
1
|
102 |
+
],
|
103 |
+
"torch_dtype": "float32",
|
104 |
+
"transformers_version": "4.42.1",
|
105 |
+
"use_weighted_layer_sum": false,
|
106 |
+
"vocab_size": 32,
|
107 |
+
"xvector_output_dim": 512
|
108 |
+
}
|
upload/mms_zs/model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:39baa2c87b9abd9910c1982bf82aabda3dbe3ba615e20d5ee0be1026975dcb8c
|
3 |
+
size 1261938632
|
upload/mms_zs/preprocessor_config.json
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"do_normalize": true,
|
3 |
+
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
|
4 |
+
"feature_size": 1,
|
5 |
+
"padding_side": "right",
|
6 |
+
"padding_value": 0,
|
7 |
+
"processor_class": "Wav2Vec2Processor",
|
8 |
+
"return_attention_mask": true,
|
9 |
+
"sampling_rate": 16000
|
10 |
+
}
|
upload/mms_zs/special_tokens_map.json
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bos_token": "<s>",
|
3 |
+
"eos_token": "</s>",
|
4 |
+
"pad_token": "<pad>",
|
5 |
+
"unk_token": "<unk>"
|
6 |
+
}
|
upload/mms_zs/tokenizer_config.json
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"added_tokens_decoder": {
|
3 |
+
"0": {
|
4 |
+
"content": "<pad>",
|
5 |
+
"lstrip": true,
|
6 |
+
"normalized": false,
|
7 |
+
"rstrip": true,
|
8 |
+
"single_word": false,
|
9 |
+
"special": false
|
10 |
+
},
|
11 |
+
"1": {
|
12 |
+
"content": "<s>",
|
13 |
+
"lstrip": true,
|
14 |
+
"normalized": false,
|
15 |
+
"rstrip": true,
|
16 |
+
"single_word": false,
|
17 |
+
"special": false
|
18 |
+
},
|
19 |
+
"2": {
|
20 |
+
"content": "</s>",
|
21 |
+
"lstrip": true,
|
22 |
+
"normalized": false,
|
23 |
+
"rstrip": true,
|
24 |
+
"single_word": false,
|
25 |
+
"special": false
|
26 |
+
},
|
27 |
+
"3": {
|
28 |
+
"content": "<unk>",
|
29 |
+
"lstrip": true,
|
30 |
+
"normalized": false,
|
31 |
+
"rstrip": true,
|
32 |
+
"single_word": false,
|
33 |
+
"special": false
|
34 |
+
}
|
35 |
+
},
|
36 |
+
"bos_token": "<s>",
|
37 |
+
"clean_up_tokenization_spaces": true,
|
38 |
+
"do_lower_case": false,
|
39 |
+
"eos_token": "</s>",
|
40 |
+
"model_max_length": 1000000000000000019884624838656,
|
41 |
+
"pad_token": "<pad>",
|
42 |
+
"processor_class": "Wav2Vec2Processor",
|
43 |
+
"replace_word_delimiter_char": " ",
|
44 |
+
"target_lang": null,
|
45 |
+
"tokenizer_class": "Wav2Vec2CTCTokenizer",
|
46 |
+
"unk_token": "<unk>",
|
47 |
+
"word_delimiter_token": "|"
|
48 |
+
}
|
upload/mms_zs/tokens.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4ad0f19332374147fcd6d270ab16330b33f81c1650ea3356db0500f47e720281
|
3 |
+
size 77
|
upload/mms_zs/vocab.json
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"'": 26,
|
3 |
+
"</s>": 2,
|
4 |
+
"<pad>": 0,
|
5 |
+
"<s>": 1,
|
6 |
+
"<unk>": 3,
|
7 |
+
"a": 5,
|
8 |
+
"b": 21,
|
9 |
+
"c": 23,
|
10 |
+
"d": 19,
|
11 |
+
"e": 7,
|
12 |
+
"f": 29,
|
13 |
+
"g": 18,
|
14 |
+
"h": 17,
|
15 |
+
"i": 6,
|
16 |
+
"j": 25,
|
17 |
+
"k": 12,
|
18 |
+
"l": 16,
|
19 |
+
"m": 13,
|
20 |
+
"n": 8,
|
21 |
+
"o": 9,
|
22 |
+
"p": 22,
|
23 |
+
"q": 30,
|
24 |
+
"r": 15,
|
25 |
+
"s": 14,
|
26 |
+
"t": 11,
|
27 |
+
"u": 10,
|
28 |
+
"v": 27,
|
29 |
+
"w": 24,
|
30 |
+
"x": 31,
|
31 |
+
"y": 20,
|
32 |
+
"z": 28,
|
33 |
+
"|": 4
|
34 |
+
}
|
uroman/.gitignore
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
!Build/
|
2 |
+
.last_cover_stats
|
3 |
+
/META.yml
|
4 |
+
/META.json
|
5 |
+
/MYMETA.*
|
6 |
+
*.o
|
7 |
+
*.pm.tdy
|
8 |
+
*.bs
|
9 |
+
|
10 |
+
# Devel::Cover
|
11 |
+
cover_db/
|
12 |
+
|
13 |
+
# Devel::NYTProf
|
14 |
+
nytprof.out
|
15 |
+
|
16 |
+
# Dizt::Zilla
|
17 |
+
/.build/
|
18 |
+
|
19 |
+
# Module::Build
|
20 |
+
_build/
|
21 |
+
Build
|
22 |
+
Build.bat
|
23 |
+
|
24 |
+
# Module::Install
|
25 |
+
inc/
|
26 |
+
|
27 |
+
# ExtUtils::MakeMaker
|
28 |
+
/blib/
|
29 |
+
/_eumm/
|
30 |
+
/*.gz
|
31 |
+
/Makefile
|
32 |
+
/Makefile.old
|
33 |
+
/MANIFEST.bak
|
34 |
+
/pm_to_blib
|
35 |
+
/*.zip
|
uroman/LICENSE.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:94d6ebab108e9e78076f28a956399effb6227658440c8ceeac1dede813431c18
|
3 |
+
size 1533
|
uroman/README.md
ADDED
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# uroman
|
2 |
+
|
3 |
+
*uroman* is a *universal romanizer*. It converts text in any script to the Latin alphabet.
|
4 |
+
|
5 |
+
Version: 1.2.8
|
6 |
+
Release date: April 23, 2021
|
7 |
+
Author: Ulf Hermjakob, USC Information Sciences Institute
|
8 |
+
|
9 |
+
|
10 |
+
### Usage
|
11 |
+
```bash
|
12 |
+
$ uroman.pl [-l <lang-code>] [--chart] [--no-cache] < STDIN
|
13 |
+
where the optional <lang-code> is a 3-letter languages code, e.g. ara, bel, bul, deu, ell, eng, fas,
|
14 |
+
grc, ell, eng, heb, kaz, kir, lav, lit, mkd, mkd2, oss, pnt, pus, rus, srp, srp2, tur, uig, ukr, yid.
|
15 |
+
--chart specifies chart output (in JSON format) to represent alternative romanizations.
|
16 |
+
--no-cache disables caching.
|
17 |
+
```
|
18 |
+
### Examples
|
19 |
+
```bash
|
20 |
+
$ bin/uroman.pl < text/zho.txt
|
21 |
+
$ bin/uroman.pl -l tur < text/tur.txt
|
22 |
+
$ bin/uroman.pl -l heb --chart < text/heb.txt
|
23 |
+
$ bin/uroman.pl < test/multi-script.txt > test/multi-script.uroman.txt
|
24 |
+
```
|
25 |
+
|
26 |
+
Identifying the input as Arabic, Belarusian, Bulgarian, English, Farsi, German,
|
27 |
+
Ancient Greek, Modern Greek, Pontic Greek, Hebrew, Kazakh, Kyrgyz, Latvian,
|
28 |
+
Lithuanian, North Macedonian, Russian, Serbian, Turkish, Ukrainian, Uyghur or
|
29 |
+
Yiddish will improve romanization for those languages as some letters in those
|
30 |
+
languages have different sound values from other languages using the same script
|
31 |
+
(French, Russian, Hebrew respectively).
|
32 |
+
No effect for other languages in this version.
|
33 |
+
|
34 |
+
### Bibliography
|
35 |
+
Ulf Hermjakob, Jonathan May, and Kevin Knight. 2018. Out-of-the-box universal romanization tool uroman. In Proceedings of the 56th Annual Meeting of Association for Computational Linguistics, Demo Track. ACL-2018 Best Demo Paper Award. [Paper in ACL Anthology](https://www.aclweb.org/anthology/P18-4003) | [Poster](https://www.isi.edu/~ulf/papers/poster-uroman-acl2018.pdf) | [BibTex](https://www.aclweb.org/anthology/P18-4003.bib)
|
36 |
+
|
37 |
+
### Change History
|
38 |
+
Changes in version 1.2.8
|
39 |
+
* Updated to Unicode 13.0 (2021), which supports several new scripts (10% larger UnicodeData.txt).
|
40 |
+
* Improved support for Georgian.
|
41 |
+
* Preserve various symbols (as opposed to mapping to the symbols' names).
|
42 |
+
* Various small improvements.
|
43 |
+
|
44 |
+
Changes in version 1.2.7
|
45 |
+
* Improved support for Pashto.
|
46 |
+
|
47 |
+
Changes in version 1.2.6
|
48 |
+
* Improved support for Ukrainian, Russian and Ogham (ancient Irish script).
|
49 |
+
* Added support for English Braille.
|
50 |
+
* Added alternative Romanization for North Macedonian and Serbian (mkd2/srp2)
|
51 |
+
reflecting a casual style that many native speakers of those languages use
|
52 |
+
when writing text in Latin script, e.g. non-accented single letters (e.g. "s")
|
53 |
+
rather than phonetically motivated combinations of letters (e.g. "sh").
|
54 |
+
* When a line starts with "::lcode xyz ", the new uroman version will switch to
|
55 |
+
that language for that line. This is used for the new reference test file.
|
56 |
+
* Various small improvements.
|
57 |
+
|
58 |
+
Changes in version 1.2.5
|
59 |
+
* Improved support for Armenian and eight languages using Cyrillic scripts.
|
60 |
+
-- For Serbian and Macedonian, which are often written in both Cyrillic
|
61 |
+
and Latin scripts, uroman will map both official versions to the same
|
62 |
+
romanized text, e.g. both "Ниш" and "Niš" will be mapped to "Nish" (which
|
63 |
+
properly reflects the pronunciation of the city's name).
|
64 |
+
For both Serbian and Macedonian, casual writers often use a simplified
|
65 |
+
Latin form without diacritics, e.g. "s" to represent not only Cyrillic "с"
|
66 |
+
and Latin "s", but also "ш" or "š", even if this conflates "s" and "sh" and
|
67 |
+
other such pairs. The casual romanization can be simulated by using
|
68 |
+
alternative uroman language codes "srp2" and "mkd2", which romanize
|
69 |
+
both "Ниш" and "Niš" to "Nis" to reflect the casual Latin spelling.
|
70 |
+
* Various small improvements.
|
71 |
+
|
72 |
+
Changes in version 1.2.4
|
73 |
+
* Bug-fix that generated two emtpy lines for each empty line in cache mode.
|
74 |
+
|
75 |
+
Changes in version 1.2
|
76 |
+
* Run-time improvement based on (1) token-based caching and (2) shortcut
|
77 |
+
romanization (identity) of ASCII strings for default 1-best (non-chart)
|
78 |
+
output. Speed-up by a factor of 10 for Bengali and Uyghur on medium and
|
79 |
+
large size texts.
|
80 |
+
* Incremental improvements for Farsi, Amharic, Russian, Hebrew and related
|
81 |
+
languages.
|
82 |
+
* Richer lattice structure (more alternatives) for "Romanization" of English
|
83 |
+
to support better matching to romanizations of other languages.
|
84 |
+
Changes output only when --chart option is specified. No change in output for
|
85 |
+
default 1-best output, which for ASCII characters is always the input string.
|
86 |
+
|
87 |
+
Changes in version 1.1 (major upgrade)
|
88 |
+
* Offers chart output (in JSON format) to represent alternative romanizations.
|
89 |
+
-- Location of first character is defined to be "line: 1, start:0, end:0".
|
90 |
+
* Incremental improvements of Hebrew and Greek romanization; Chinese numbers.
|
91 |
+
* Improved web-interface at http://www.isi.edu/~ulf/uroman.html
|
92 |
+
-- Shows corresponding original and romanization text in red
|
93 |
+
when hovering over a text segment.
|
94 |
+
-- Shows alternative romanizations when hovering over romanized text
|
95 |
+
marked by dotted underline.
|
96 |
+
-- Added right-to-left script detection and improved display for right-to-left
|
97 |
+
script text (as determined line by line).
|
98 |
+
-- On-page support for some scripts that are often not pre-installed on users'
|
99 |
+
computers (Burmese, Egyptian, Klingon).
|
100 |
+
|
101 |
+
Changes in version 1.0 (major upgrade)
|
102 |
+
* Upgraded principal internal data structure from string to lattice.
|
103 |
+
* Improvements mostly in vowelization of South and Southeast Asian languages.
|
104 |
+
* Vocalic 'r' more consistently treated as vowel (no additional vowel added).
|
105 |
+
* Repetition signs (Japanese/Chinese/Thai/Khmer/Lao) are mapped to superscript 2.
|
106 |
+
* Japanese Katakana middle dots now mapped to ASCII space.
|
107 |
+
* Tibetan intersyllabic mark now mapped to middle dot (U+00B7).
|
108 |
+
* Some corrections regarding analysis of Chinese numbers.
|
109 |
+
* Many more foreign diacritics and punctuation marks dropped or mapped to ASCII.
|
110 |
+
* Zero-width characters dropped, except line/sentence-initial byte order marks.
|
111 |
+
* Spaces normalized to ASCII space.
|
112 |
+
* Fixed bug that in some cases mapped signs (such as dagger or bullet) to their verbal descriptions.
|
113 |
+
* Tested against previous version of uroman with a new uroman visual diff tool.
|
114 |
+
* Almost an order of magnitude faster.
|
115 |
+
|
116 |
+
Changes in version 0.7 (minor upgrade)
|
117 |
+
* Added script uroman-quick.pl for Arabic script languages, incl. Uyghur.
|
118 |
+
Much faster, pre-caching mapping of Arabic to Latin characters, simple greedy processing.
|
119 |
+
Will not convert material from non-Arabic blocks such as any (somewhat unusual) Cyrillic
|
120 |
+
or Chinese characters in Uyghur texts.
|
121 |
+
|
122 |
+
Changes in version 0.6 (minor upgrade)
|
123 |
+
* Added support for two letter characters used in Uzbek:
|
124 |
+
(1) character "ʻ" ("modifier letter turned comma", which modifies preceding "g" and "u" letters)
|
125 |
+
(2) character "ʼ" ("modifier letter apostrophe", which Uzbek uses to mark a glottal stop).
|
126 |
+
Both are now mapped to "'" (plain ASCII apostrophe).
|
127 |
+
* Added support for Uyghur vowel characters such as "ې" (Arabic e) and "ۆ" (Arabic oe)
|
128 |
+
even when they are not preceded by "ئ" (yeh with hamza above).
|
129 |
+
* Added support for Arabic semicolon "؛", Arabic ligature forms for phrases such as "ﷺ"
|
130 |
+
("sallallahou alayhe wasallam" = "prayer of God be upon him and his family and peace")
|
131 |
+
* Added robustness for Arabic letter presentation forms (initial/medial/final/isolated).
|
132 |
+
However, it is strongly recommended to normalize any presentation form Arabic letters
|
133 |
+
to their non-presentation form before calling uroman.
|
134 |
+
* Added force flush directive ($|=1;).
|
135 |
+
|
136 |
+
Changes in version 0.5 (minor upgrade)
|
137 |
+
* Improvements for Uyghur (make sure to use language option: -l uig)
|
138 |
+
|
139 |
+
Changes in version 0.4 (minor upgrade)
|
140 |
+
* Improvements for Thai (special cases for vowel/consonant reordering, e.g. for "sara o"; dropped some aspiration 'h's)
|
141 |
+
* Minor change for Arabic (added "alef+fathatan" = "an")
|
142 |
+
|
143 |
+
New features in version 0.3
|
144 |
+
* Covers Mandarin (Chinese)
|
145 |
+
* Improved romanization for numerous languages
|
146 |
+
* Preserves capitalization (e.g. from Latin, Cyrillic, Greek scripts)
|
147 |
+
* Maps from native digits to Western numbers
|
148 |
+
* Faster for South Asian languages
|
149 |
+
|
150 |
+
### Other features
|
151 |
+
* Web interface: http://www.isi.edu/~ulf/uroman.html
|
152 |
+
* Vowelization is provided when locally computable, e.g. for many South Asian languages and Tibetan.
|
153 |
+
|
154 |
+
### Limitations
|
155 |
+
* The current version of uroman has a few limitations, some of which we plan to address in future versions.
|
156 |
+
For Japanese, *uroman* currently romanizes hiragana and katakana as expected, but kanji are interpreted as Chinese characters and romanized as such.
|
157 |
+
For Egyptian hieroglyphs, only single-sound phonetic characters and numbers are currently romanized.
|
158 |
+
For Linear B, only phonetic syllabic characters are romanized.
|
159 |
+
For some other extinct scripts such as cuneiform, no romanization is provided.
|
160 |
+
* A romanizer is not a full transliterator. For example, this version of
|
161 |
+
uroman does not vowelize text that lacks explicit vowelization such as
|
162 |
+
normal text in Arabic and Hebrew (without diacritics/points).
|
163 |
+
|
164 |
+
### Acknowledgments
|
165 |
+
This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract # FA8650-17-C-9116, and by research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, Air Force Laboratory, DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
|