Simsamu transcription model

This repository contains a pretrained speechbrain transcription model for the french language that was fine-tuned on the Simsamu dataset.

The model is a CTC-based model on top of wav2vec2 embeddings, trained on data from the CommonVoice, PxCorpus and Simsamu datasets. The CTC layers were trained from scratch and the wav2vec2 layers were fine-tuned.

The model can be used in medkit the following way:

from medkit.core.audio import AudioDocument
from medkit.audio.segmentation.pa_speaker_detector import PASpeakerDetector
from medkit.audio.transcription.sb_transcriber import SBTranscriber

# init speaker detector operation
speaker_detector = PASpeakerDetector(
    model="medkit/simsamu-diarization",
    device=0,
    segmentation_batch_size=10,
    embedding_batch_size=10,
)

# init transcriber operation
transcriber = SBTranscriber(
    model="medkit/simsamu-transcription",
    needs_decoder=False,
    output_label="transcription",
    device=0,
    batch_size=10,
)

# create audio document
audio_doc = AudioDocument.from_file("path/to/audio.wav")

# apply speaker detector operation on audio document
# to get speech segments
speech_segments = speaker_detector.run([audio_doc.raw_segment])

# apply transcriber operation on speech segments
transcriber.run(speech_segments)

# display transcription for each speech turn
for speech_seg in speech_segments:
    transcription_attr = speech_seg.attrs.get(label="transcription")[0]
    print(speech_seg.span.start, speech_seg.span.end, transcription_attr.value)

More info at https://medkit.readthedocs.io/

See also: Simsamu diarization pipeline

Downloads last month
6
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train medkit/simsamu-transcription