Supplemental reading and resources
This Unit pieced together many components from previous units, introducing the tasks of speech-to-speech translation, voice assistants and speaker diarization. The supplemental reading material is thus split into these three new tasks for your convenience:
Speech-to-speech translation:
- STST with discrete units by Meta AI: a direct approach to STST through encoder-decoder models
- Hokkien direct speech-to-speech translation by Meta AI: a direct approach to STST using encoder-decoder models with a two-stage decoder
- Leveraging unsupervised and weakly-supervised data to improve direct STST by Google: proposes new approaches for leveraging unsupervised and weakly supervised data for training direct STST models and a small change to the Transformer architecture
- Translatotron-2 by Google: a system that is able to retain speaker characteristics in translated speech
Voice Assistant:
- Accurate wakeword detection by Amazon: a low latency approach for wakeword detection for on-device applications
- RNN-Transducer Architecture by Google: a modification to the CTC architecture for streaming on-device ASR
Meeting Transcriptions:
- pyannote.audio Technical Report by Hervé Bredin: this report describes the main principles behind the
pyannote.audio
speaker diarization pipeline - Whisper X by Max Bain et al.: a superior approach to computing word-level timestamps using the Whisper model