Pierre-Carl Langlais's picture

Pierre-Carl Langlais

Pclanglais

·

Dorialexander

AI & ML interests

Open data & open LLMs

Recent Activity

liked a dataset about 9 hours ago

datalab-to/marker_comparison_mistral_llm

updated a dataset about 15 hours ago

PleIAs/Medical-Commons

published a dataset about 16 hours ago

PleIAs/Medical-Commons

View all activity

Organizations

Posts 6

Post

2967

We release today our first foundation model and experiment with a new category: specialized pre-training.

OCRonos-Vintage is a 124m parameters model trained end-to-end by Pleias on llm.c from 18 billion tokens from cultural heritage archives. Despite its small size it achieve nearly state of the art results for OCR correction of historical English sources. OCRonos-Vintage is also an historical model with an unusual cut-off date: December 29th, 1955…

We look forward to replicate this approach very soon on other "hard" tasks commonly associated with generalist LLMs/SLMs: RAG, function calling, summarization, document segmentation…

OCRonos-Vintage: PleIAs/OCRonos-Vintage
CPU Demo: PleIAs/OCRonos-Vintage-CPU
GPU Demo: PleIAs/OCRonos-Vintage-GPU
Our annoncement and call for specialized pre-training: https://huggingface.co./blog/Pclanglais/specialized-pre-training

Articles 7

Article

82

They Said It Couldn’t Be Done

View all Articles

Papers 1

arxiv:2501.08365

spaces 9

Reversed Zotero

Editorialization

Correction-OCR

Tchap

Motta

tag_theme

models 38

Pclanglais/Popeye-1929

Text-to-Image • Updated Dec 31, 2024 • 27 •

Pclanglais/Pleias-Nano-onnx

Text Generation • Updated Dec 9, 2024 • 29

Pclanglais/Pleias-Pico-onnx

Updated Dec 9, 2024 • 24

Pclanglais/Headlines-OCR-Correction

Updated Oct 25, 2024 • 14

Pclanglais/SynthRag3

Updated Sep 11, 2024 • 14

Pclanglais/SynthRag2

Updated Sep 9, 2024 • 10

Pclanglais/SynthRag1

Updated Sep 8, 2024 • 8

Pclanglais/Experiment1

Updated Sep 5, 2024 • 13

Pclanglais/Segmentext-Marianne

Updated Aug 28, 2024 • 6

Pclanglais/OCRonos-Vintage-GGUF

Updated Aug 11, 2024

datasets 12

Pclanglais/course-material

Viewer • Updated 3 days ago • 84.3k • 2.69k

Pclanglais/tokenized_sample

Viewer • Updated 25 days ago • 1.54M • 1.3k

Pclanglais/pdf_sample_10k

Viewer • Updated Nov 30, 2024 • 415k • 27 • 1

Pclanglais/open-science

Viewer • Updated Nov 15, 2024 • 10.8M • 251

Pclanglais/LLM-for-DH

Viewer • Updated Jul 14, 2024 • 1.62k • 23

Pclanglais/youtube-commons-metadata

Viewer • Updated Jun 19, 2024 • 6.91M • 44

Pclanglais/OCR-test

Viewer • Updated Apr 22, 2024 • 20.1k • 39 • 1

Pclanglais/AllWikidataCharacters

Viewer • Updated Apr 14, 2024 • 180k • 86 • 7

Pclanglais/wiki-dataset

Viewer • Updated Jan 4, 2024 • 282 • 197

Pclanglais/Mickey-1928-dataset

Viewer • Updated Dec 31, 2023 • 96 • 209 • 7