We release today our first foundation model and experiment with a new category: specialized pre-training.
OCRonos-Vintage is a 124m parameters model trained end-to-end by Pleias on llm.c from 18 billion tokens from cultural heritage archives. Despite its small size it achieve nearly state of the art results for OCR correction of historical English sources. OCRonos-Vintage is also an historical model with an unusual cut-off date: December 29th, 1955…
We look forward to replicate this approach very soon on other "hard" tasks commonly associated with generalist LLMs/SLMs: RAG, function calling, summarization, document segmentation…
Since it is release season, at PleIAs we announce our first suite of specialized language models for document processing tasks (OCR correction, text segmentation, bibliographic extraction) and the release of the largest multimodal dataset of financial document Finance Commons: https://huggingface.co./blog/Pclanglais/finance-commons-bad-data-toolbox
LLM research is currently focused on quality data. We went on the opposite direction and voluntarily trained models on bad data. Far from degrading models, it made them more resilient to text sources commonly used in production.
Having a wider range of real life data proved critical for this project. A few months after the release of Common Corpus, we expanded our pool of "training data commons" with a major multimodal ressource: document released as open financial data. Finance commons comprises 17 billion tokens and 1.25 PDF corporate documents released by the SEC, WTO, AMF, EU Tenders In a multiple languages with a large variety of document layouts and challenging sources to train more robust models.
With HuggingFace compute support, we release an entire pipeline to process bad data sources and make them usable in production for LLMOps or simply retrieval: PleIAs/PleIAs-Editor
This approach is based on our new series of specialized models for document processing, the "bad data toolbox" comprising: *OCRonos, the best available model to date for OCR correction. PleIAs/OCRonos *Segmentext, a pure semantic small model for text segmentation, working without any visual reference. PleIAs/Segmentext *Bibtexer, a small model for bibliographic data extraction acting as a "reversed-Zotero." PleIAs/BibTexer