Supplemental reading and resources

This Unit pieced together many components from previous units, introducing the tasks of speech-to-speech translation, voice assistants and speaker diarization. The supplemental reading material is thus split into these three new tasks for your convenience:

Speech-to-speech translation:

STST with discrete units by Meta AI: a direct approach to STST through encoder-decoder models
Hokkien direct speech-to-speech translation by Meta AI: a direct approach to STST using encoder-decoder models with a two-stage decoder
Leveraging unsupervised and weakly-supervised data to improve direct STST by Google: proposes new approaches for leveraging unsupervised and weakly supervised data for training direct STST models and a small change to the Transformer architecture
Translatotron-2 by Google: a system that is able to retain speaker characteristics in translated speech

Voice Assistant:

Accurate wakeword detection by Amazon: a low latency approach for wakeword detection for on-device applications
RNN-Transducer Architecture by Google: a modification to the CTC architecture for streaming on-device ASR

Meeting Transcriptions:

pyannote.audio Technical Report by Hervé Bredin: this report describes the main principles behind the pyannote.audio speaker diarization pipeline
Whisper X by Max Bain et al.: a superior approach to computing word-level timestamps using the Whisper model

Audio Course

Supplemental reading and resources