Updates README with discussion
Browse files
README.md
CHANGED
@@ -224,3 +224,343 @@ The following hyperparameters were used during training:
|
|
224 |
- Pytorch 2.0.1+cu118
|
225 |
- Datasets 2.12.0
|
226 |
- Tokenizers 0.13.3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
224 |
- Pytorch 2.0.1+cu118
|
225 |
- Datasets 2.12.0
|
226 |
- Tokenizers 0.13.3
|
227 |
+
|
228 |
+
## Discussion
|
229 |
+
|
230 |
+
### Nans and Infs
|
231 |
+
|
232 |
+
While debugging other training sessions where more data from the Esperanto Common Voice dataset was used -- some loss calculations were returning either `inf` or `nan` -- I found that some of the training set trained with this model had surprisingly high CER. Some examples:
|
233 |
+
|
234 |
+
| file | Actual<br>Predicted | CER | Comment |
|
235 |
+
|:-----|:--------------------|:----|:--------|
|
236 |
+
|common_voice_eo_25365027.mp3 | en la hansaj agentejoj komercistoj el la regiono renkontis kolegojn el aliaj regionoj<br>a taaj keo eoj eejn kigos eegoj eioeegiooj| 0.61 | No audio |
|
237 |
+
|common_voice_eo_25365472.mp3 | ili vendas armilojn kaj teknologiojn al la fanatikuloj por gajni monon monon monon<br>ila mamato aiil ajn kno ion a a aotigojn pu aiooo aj knon | 0.55 | Barely any audio, distorted |
|
238 |
+
|common_voice_eo_25365836.mp3 | industria apliko estas la kreado de modifitaj bakterioj kiuj produktas deziratan kemian substancon<br>iiti sieetas la eeadooddddooiooaotooeioj aiicenon | 0.67 | Barely any audio, distorted |
|
239 |
+
|2600 | ili akiras plenkreskan plumaron nur en la kvina jaro<br>ili aaros peetaj patato a a sia ro | 0.52 | It's literally someone saying 'injabum'. Thanks, troll. |
|
240 |
+
|7333 | poste sekvas difinoj de la termino<br>po | 0.94 | No audio |
|
241 |
+
|7334 | li gvidis multajn kursojn laŭ la csehmetodo<br>po | 0.98 | No audio |
|
242 |
+
|7429 | tamen pro la rekonstruo de kluzoj ne eblas trapasi komplete<br>po | 0.97 | No audio |
|
243 |
+
|11662 | lingvotesto estas postulata ekzemple por akceptiĝo en anglalingvaj altlernejoj<br>linkonteto estastitot etateerteito en pootaeaje lgijoj | 0.58 | No audio |
|
244 |
+
|
245 |
+
Some examples have no audio. All of these files in the dataset are completely useless, and should be removed from the training set.
|
246 |
+
|
247 |
+
You can see that the model is trying to hallucinate the target when there's little or no audio. This is terrible for realistically reporting what was said. I'd also hope that there is some measure of certainty, and maybe only go with transcriptions that have relatively high certainty. However, I can't find how to get at a certainty value.
|
248 |
+
|
249 |
+
The Common Voice dataset also contains upvotes and downvotes. Of the high CER sentences above, all had 2 upvotes, with some having 0 downvotes, and some having 1. So we cannot rely on upvotes or downvotes to detect quality.
|
250 |
+
|
251 |
+
So what to do?
|
252 |
+
|
253 |
+
### Alternative 1
|
254 |
+
|
255 |
+
Despite these zero- and low-quality files, training seems to work OK. However, we still need to address when loss becomes `nan` or `inf` because that ruins the calculation.
|
256 |
+
|
257 |
+
By running `run_speech_recognition_ctc` with `do_train=false`, setting `model_name_or_path="xekri/wav2vec2-common_voice_13_0-eo-3"`, setting `eval_split_name` to either `test`, `validation`, or `train`, and also modifying `trainer.py` as follows, I can check if any losses are nan or inf:
|
258 |
+
|
259 |
+
```py
|
260 |
+
# To be JSON-serializable, we need to remove numpy types or zero-d tensors
|
261 |
+
metrics = denumpify_detensorize(metrics)
|
262 |
+
|
263 |
+
if all_losses is not None:
|
264 |
+
loss_nan = np.where(np.isnan(all_losses))
|
265 |
+
if len(loss_nan) != 0:
|
266 |
+
print(f'LOSSES ARE NAN: {loss_nan}')
|
267 |
+
loss_inf = np.where(np.isinf(all_losses))
|
268 |
+
if len(loss_inf) != 0:
|
269 |
+
print(f'LOSSES ARE INF: {loss_inf}')
|
270 |
+
metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
|
271 |
+
```
|
272 |
+
|
273 |
+
Doing this shows that of the 14913 examples in `test`, the following file results in `inf` loss:
|
274 |
+
|
275 |
+
`common_voice_eo_25167318.mp3`
|
276 |
+
|
277 |
+
The audio on this is severly garbled. This should absolutely be filtered out of the test set.
|
278 |
+
|
279 |
+
No `validation` samples result in `inf` or `nan`.
|
280 |
+
|
281 |
+
The following files out of the 143984 examples in `train` result in `inf` loss:
|
282 |
+
|
283 |
+
```txt
|
284 |
+
common_voice_eo_25467641.mp3
|
285 |
+
common_voice_eo_25467723.mp3
|
286 |
+
common_voice_eo_25467791.mp3
|
287 |
+
common_voice_eo_25467820.mp3
|
288 |
+
common_voice_eo_25467943.mp3
|
289 |
+
common_voice_eo_25478612.mp3
|
290 |
+
common_voice_eo_25478623.mp3
|
291 |
+
common_voice_eo_25478631.mp3
|
292 |
+
common_voice_eo_25478756.mp3
|
293 |
+
common_voice_eo_25478762.mp3
|
294 |
+
common_voice_eo_25478768.mp3
|
295 |
+
common_voice_eo_25478769.mp3
|
296 |
+
common_voice_eo_25479150.mp3
|
297 |
+
common_voice_eo_25479203.mp3
|
298 |
+
common_voice_eo_25479229.mp3
|
299 |
+
common_voice_eo_25517673.mp3
|
300 |
+
common_voice_eo_25517677.mp3
|
301 |
+
common_voice_eo_25527739.mp3
|
302 |
+
```
|
303 |
+
|
304 |
+
Those files have no audio.
|
305 |
+
|
306 |
+
### Alternative 2
|
307 |
+
|
308 |
+
Another possibility is just to go through the audio files and throw away any where the peak audio isn't above some threshold.
|
309 |
+
|
310 |
+
### Alternative 3
|
311 |
+
|
312 |
+
Since this model seems to work well enough, I could run inference on all samples, and just discard the ones where the CER (as determined by this model) is too high, say above 0.5. Then use that to filter the examples and train another model. These high-CER examples are:
|
313 |
+
|
314 |
+
#### Test set
|
315 |
+
|
316 |
+
```txt
|
317 |
+
common_voice_eo_25214319.mp3
|
318 |
+
common_voice_eo_25006596.mp3
|
319 |
+
common_voice_eo_27472721.mp3
|
320 |
+
common_voice_eo_27715088.mp3
|
321 |
+
common_voice_eo_27715091.mp3
|
322 |
+
common_voice_eo_26677019.mp3
|
323 |
+
common_voice_eo_26677023.mp3
|
324 |
+
common_voice_eo_20555291.mp3
|
325 |
+
common_voice_eo_25001942.mp3
|
326 |
+
common_voice_eo_25457354.mp3
|
327 |
+
common_voice_eo_25457355.mp3
|
328 |
+
common_voice_eo_25457365.mp3
|
329 |
+
common_voice_eo_25457373.mp3
|
330 |
+
common_voice_eo_25457396.mp3
|
331 |
+
common_voice_eo_25457397.mp3
|
332 |
+
common_voice_eo_25457409.mp3
|
333 |
+
common_voice_eo_25457410.mp3
|
334 |
+
common_voice_eo_25457412.mp3
|
335 |
+
common_voice_eo_25457442.mp3
|
336 |
+
common_voice_eo_25457444.mp3
|
337 |
+
common_voice_eo_25457445.mp3
|
338 |
+
common_voice_eo_25457577.mp3
|
339 |
+
common_voice_eo_25457578.mp3
|
340 |
+
common_voice_eo_28064453.mp3
|
341 |
+
common_voice_eo_25047803.mp3
|
342 |
+
common_voice_eo_25048418.mp3
|
343 |
+
common_voice_eo_25048419.mp3
|
344 |
+
common_voice_eo_25048421.mp3
|
345 |
+
common_voice_eo_25048423.mp3
|
346 |
+
common_voice_eo_25048428.mp3
|
347 |
+
common_voice_eo_25048574.mp3
|
348 |
+
common_voice_eo_25885643.mp3
|
349 |
+
common_voice_eo_25885645.mp3
|
350 |
+
common_voice_eo_26794882.mp3
|
351 |
+
common_voice_eo_27356529.mp3
|
352 |
+
common_voice_eo_25012640.mp3
|
353 |
+
common_voice_eo_25303457.mp3
|
354 |
+
common_voice_eo_18153931.mp3
|
355 |
+
common_voice_eo_18776206.mp3
|
356 |
+
common_voice_eo_18776208.mp3
|
357 |
+
common_voice_eo_18776219.mp3
|
358 |
+
common_voice_eo_18776220.mp3
|
359 |
+
common_voice_eo_18776222.mp3
|
360 |
+
common_voice_eo_18776223.mp3
|
361 |
+
common_voice_eo_18776236.mp3
|
362 |
+
common_voice_eo_18776238.mp3
|
363 |
+
common_voice_eo_18776244.mp3
|
364 |
+
common_voice_eo_18776248.mp3
|
365 |
+
common_voice_eo_18776285.mp3
|
366 |
+
common_voice_eo_18776287.mp3
|
367 |
+
common_voice_eo_18776297.mp3
|
368 |
+
common_voice_eo_18776298.mp3
|
369 |
+
common_voice_eo_25047998.mp3
|
370 |
+
common_voice_eo_25047999.mp3
|
371 |
+
common_voice_eo_25048000.mp3
|
372 |
+
common_voice_eo_25048001.mp3
|
373 |
+
common_voice_eo_25048002.mp3
|
374 |
+
common_voice_eo_25053113.mp3
|
375 |
+
common_voice_eo_25068355.mp3
|
376 |
+
common_voice_eo_25333056.mp3
|
377 |
+
common_voice_eo_25371639.mp3
|
378 |
+
common_voice_eo_25371640.mp3
|
379 |
+
common_voice_eo_25371641.mp3
|
380 |
+
common_voice_eo_25371642.mp3
|
381 |
+
common_voice_eo_25371643.mp3
|
382 |
+
common_voice_eo_22441946.mp3
|
383 |
+
common_voice_eo_26622121.mp3
|
384 |
+
common_voice_eo_25167318.mp3
|
385 |
+
common_voice_eo_25252685.mp3
|
386 |
+
common_voice_eo_25252698.mp3
|
387 |
+
common_voice_eo_25518636.mp3
|
388 |
+
```
|
389 |
+
|
390 |
+
Note on `test[100]` and `test[101]`: We know that `saluton kiel vi fartas` and `atendu momenton` is a good start, but if that's not the text to record, you're not really helping.
|
391 |
+
|
392 |
+
#### Validation set
|
393 |
+
|
394 |
+
141 of
|
395 |
+
```txt
|
396 |
+
common_voice_eo_25392669.mp3
|
397 |
+
common_voice_eo_25392674.mp3
|
398 |
+
common_voice_eo_25392675.mp3
|
399 |
+
common_voice_eo_25392676.mp3
|
400 |
+
common_voice_eo_25392678.mp3
|
401 |
+
common_voice_eo_25392693.mp3
|
402 |
+
common_voice_eo_25392694.mp3
|
403 |
+
common_voice_eo_25392695.mp3
|
404 |
+
common_voice_eo_25392697.mp3
|
405 |
+
common_voice_eo_25392701.mp3
|
406 |
+
common_voice_eo_25392702.mp3
|
407 |
+
common_voice_eo_25392708.mp3
|
408 |
+
common_voice_eo_25392709.mp3
|
409 |
+
common_voice_eo_25408881.mp3
|
410 |
+
common_voice_eo_25408882.mp3
|
411 |
+
common_voice_eo_25408885.mp3
|
412 |
+
common_voice_eo_27380623.mp3
|
413 |
+
```
|
414 |
+
|
415 |
+
I didn't include some which had high CER because of hallucinations during a one-word recording with lots of silence before and after. The recording itself is fine on these.
|
416 |
+
|
417 |
+
|
418 |
+
#### Training set
|
419 |
+
|
420 |
+
135 of 143984 examples yielded high CER. I removed some from this list that had high CER but sounded fine.
|
421 |
+
|
422 |
+
```txt
|
423 |
+
common_voice_eo_25365027.mp3
|
424 |
+
common_voice_eo_25365472.mp3
|
425 |
+
common_voice_eo_25365480.mp3
|
426 |
+
common_voice_eo_25365532.mp3
|
427 |
+
common_voice_eo_25365695.mp3
|
428 |
+
common_voice_eo_25365744.mp3
|
429 |
+
common_voice_eo_25365804.mp3
|
430 |
+
common_voice_eo_25365836.mp3
|
431 |
+
common_voice_eo_25365855.mp3
|
432 |
+
common_voice_eo_25372587.mp3
|
433 |
+
common_voice_eo_25401060.mp3
|
434 |
+
common_voice_eo_25430837.mp3
|
435 |
+
common_voice_eo_25444509.mp3
|
436 |
+
common_voice_eo_25240777.mp3
|
437 |
+
common_voice_eo_24942754.mp3
|
438 |
+
common_voice_eo_24942755.mp3
|
439 |
+
common_voice_eo_24990372.mp3
|
440 |
+
common_voice_eo_24990385.mp3
|
441 |
+
common_voice_eo_24990390.mp3
|
442 |
+
common_voice_eo_24990397.mp3
|
443 |
+
common_voice_eo_24990413.mp3
|
444 |
+
common_voice_eo_24990427.mp3
|
445 |
+
common_voice_eo_24990429.mp3
|
446 |
+
common_voice_eo_24990435.mp3
|
447 |
+
common_voice_eo_24990441.mp3
|
448 |
+
common_voice_eo_24990454.mp3
|
449 |
+
common_voice_eo_24990457.mp3
|
450 |
+
common_voice_eo_24990459.mp3
|
451 |
+
common_voice_eo_24990490.mp3
|
452 |
+
common_voice_eo_25529345.mp3
|
453 |
+
common_voice_eo_25648750.mp3
|
454 |
+
common_voice_eo_28670472.mp3
|
455 |
+
common_voice_eo_27931966.mp3
|
456 |
+
common_voice_eo_28252265.mp3
|
457 |
+
common_voice_eo_25454951.mp3
|
458 |
+
common_voice_eo_25927616.mp3
|
459 |
+
common_voice_eo_25153203.mp3
|
460 |
+
common_voice_eo_25238543.mp3
|
461 |
+
common_voice_eo_25284237.mp3
|
462 |
+
common_voice_eo_25460131.mp3
|
463 |
+
common_voice_eo_25460185.mp3
|
464 |
+
common_voice_eo_25460186.mp3
|
465 |
+
common_voice_eo_25460188.mp3
|
466 |
+
common_voice_eo_25460189.mp3
|
467 |
+
common_voice_eo_25446723.mp3
|
468 |
+
common_voice_eo_26025150.mp3
|
469 |
+
common_voice_eo_26640189.mp3
|
470 |
+
common_voice_eo_26888468.mp3
|
471 |
+
common_voice_eo_24844824.mp3
|
472 |
+
common_voice_eo_25022506.mp3
|
473 |
+
common_voice_eo_25022507.mp3
|
474 |
+
common_voice_eo_25022516.mp3
|
475 |
+
common_voice_eo_25032858.mp3
|
476 |
+
common_voice_eo_25032859.mp3
|
477 |
+
common_voice_eo_25032865.mp3
|
478 |
+
common_voice_eo_25243988.mp3
|
479 |
+
common_voice_eo_25244009.mp3
|
480 |
+
common_voice_eo_25266094.mp3
|
481 |
+
common_voice_eo_25266141.mp3
|
482 |
+
common_voice_eo_25285278.mp3
|
483 |
+
common_voice_eo_25286768.mp3
|
484 |
+
common_voice_eo_25457171.mp3
|
485 |
+
common_voice_eo_25467641.mp3
|
486 |
+
common_voice_eo_25467723.mp3
|
487 |
+
common_voice_eo_25467791.mp3
|
488 |
+
common_voice_eo_25467820.mp3
|
489 |
+
common_voice_eo_25467943.mp3
|
490 |
+
common_voice_eo_25478612.mp3
|
491 |
+
common_voice_eo_25478623.mp3
|
492 |
+
common_voice_eo_25478631.mp3
|
493 |
+
common_voice_eo_25478756.mp3
|
494 |
+
common_voice_eo_25478762.mp3
|
495 |
+
common_voice_eo_25478768.mp3
|
496 |
+
common_voice_eo_25478769.mp3
|
497 |
+
common_voice_eo_25479150.mp3
|
498 |
+
common_voice_eo_25479203.mp3
|
499 |
+
common_voice_eo_25479229.mp3
|
500 |
+
common_voice_eo_25517673.mp3
|
501 |
+
common_voice_eo_25517677.mp3
|
502 |
+
common_voice_eo_25527739.mp3
|
503 |
+
common_voice_eo_25975149.mp3
|
504 |
+
common_voice_eo_26193748.mp3
|
505 |
+
common_voice_eo_28401039.mp3
|
506 |
+
common_voice_eo_28421315.mp3
|
507 |
+
common_voice_eo_28937347.mp3
|
508 |
+
common_voice_eo_24890414.mp3
|
509 |
+
common_voice_eo_25294479.mp3
|
510 |
+
common_voice_eo_25438966.mp3
|
511 |
+
common_voice_eo_28855568.mp3
|
512 |
+
common_voice_eo_29011007.mp3
|
513 |
+
common_voice_eo_24599888.mp3
|
514 |
+
common_voice_eo_26964252.mp3
|
515 |
+
common_voice_eo_26964496.mp3
|
516 |
+
common_voice_eo_26964510.mp3
|
517 |
+
common_voice_eo_25432789.mp3
|
518 |
+
common_voice_eo_26688158.mp3
|
519 |
+
common_voice_eo_28516354.mp3
|
520 |
+
common_voice_eo_24790865.mp3
|
521 |
+
common_voice_eo_24790897.mp3
|
522 |
+
common_voice_eo_24790898.mp3
|
523 |
+
common_voice_eo_24790899.mp3
|
524 |
+
common_voice_eo_24790900.mp3
|
525 |
+
common_voice_eo_25362713.mp3
|
526 |
+
common_voice_eo_27585084.mp3
|
527 |
+
common_voice_eo_24813131.mp3
|
528 |
+
common_voice_eo_25035262.mp3
|
529 |
+
common_voice_eo_26000289.mp3
|
530 |
+
common_voice_eo_26003943.mp3
|
531 |
+
common_voice_eo_26283983.mp3
|
532 |
+
common_voice_eo_28708931.mp3
|
533 |
+
common_voice_eo_28037217.mp3
|
534 |
+
common_voice_eo_29273106.mp3
|
535 |
+
common_voice_eo_26006657.mp3
|
536 |
+
common_voice_eo_25399924.mp3
|
537 |
+
common_voice_eo_27982431.mp3
|
538 |
+
common_voice_eo_25893779.mp3
|
539 |
+
common_voice_eo_27842061.mp3
|
540 |
+
common_voice_eo_25052385.mp3
|
541 |
+
common_voice_eo_25807395.mp3
|
542 |
+
common_voice_eo_25807985.mp3
|
543 |
+
common_voice_eo_25808039.mp3
|
544 |
+
common_voice_eo_25808407.mp3
|
545 |
+
common_voice_eo_25809036.mp3
|
546 |
+
common_voice_eo_27487795.mp3
|
547 |
+
common_voice_eo_28460556.mp3
|
548 |
+
common_voice_eo_28884851.mp3
|
549 |
+
common_voice_eo_24819719.mp3
|
550 |
+
common_voice_eo_25153594.mp3
|
551 |
+
common_voice_eo_25234585.mp3
|
552 |
+
common_voice_eo_25245164.mp3
|
553 |
+
common_voice_eo_27538877.mp3
|
554 |
+
common_voice_eo_24862771.mp3
|
555 |
+
common_voice_eo_25070167.mp3
|
556 |
+
common_voice_eo_26381720.mp3
|
557 |
+
common_voice_eo_28110376.mp3
|
558 |
+
```
|
559 |
+
|
560 |
+
### Alternative 3.1
|
561 |
+
|
562 |
+
Of those files that have no or distorted audio, maybe change their target to be empty? Except for 'injabum'.
|
563 |
+
|
564 |
+
### And also
|
565 |
+
|
566 |
+
Since one can sign up at Common Voice to review Esperanto audio files, I've done so in the hopes of making a small contribution in quality.
|