AbrahamSanders's picture
initial model version
7e21e37
metadata
license: cc-by-3.0

Base model facebook/opt-2.7b

Fine-tuned for causal language modeling of transcribed spoken dialogue from the TalkBank CABank collection. Training corpora include:

(Corpus descriptions are from TalkBank)

Data input format: The data format models a sequence of spoken dialogue between two or more participants:

  • The sequence is prefixed with information about the participants including name (can be a proper noun, a title/role, or unknown), age (can be a number or unknown), and sex (can be male, female, other, unknown).
  • It then proceeds to sequentially list all utterances in the conversation, each prefixed with their participant code (S1, S2, S3, etc.).
  • Utterances support a limited set of transcription notations in the CHAT & CHAT-CA formats:
    • Pauses: (.) for a generic short pause, or (N.N) for a timed pause. For example (3.4) is a pause for 3.4 seconds.
    • Non-verbal sounds: &=laughs, &=cough, &=breathes, &=click, etc. Anything describing a speaker-produced non-verbal sound can come after a prefix of &=
    • Comments about speaker or setting: [% baby crying in background], [% smiling], [% phone clicking noise], [% imitating him], etc. Anything describing the state of the speaker or environment can be in this block. Also, a comment block can be used to describe speaker-produced sounds, but it is more common to use the &= prefix for that.
    • Unknown or unintelligible utterances: xxx
    • Breathing: hhh

Example:

<participant> S1 (name: Dave, age: 33, sex: male) <participant> S2 (name: unknown, age: unknown, sex: unknown) <dialog> S1: Hi! (2.3) are you there? S2: hhh hhh [% background noise] uh yeah (0.8) I can hear you. (1.2) &=cough can you hear me? S1: ...

Usage Info:

Per the OPT documentation, the model was trained with tokenizer setting use_fast=False.

To use this model for real-time inference in a continuous duplex dialogue system, see: https://github.com/AbrahamSanders/realtime-chatbot.