LaBSE-Malach-Multilabel

A multilabel text classification model fine-tuned on an English subset (Malach ASR) of the Visual History Archive. Based on LaBSE pretrained weights but it uses the general Hugging Face framework, not sentence-transformers. Input text segments consisted of ~350 words on average.

Given an input string, the model predicts probablites for 1063 keyword IDs from the VHA ontology, sorted by probability. Typically, probabilities >= 0.5 are "True" if encoding them in a binary vector.

The mapping from keyword IDs to labels will be added to the repository.