Excited to share Monkt - a tool I built to solve the eternal headache of processing documents for ML/AI pipelines.
What it does: Converts PDFs, Word, PowerPoint, Excel, Web pages or raw HTML into clean Markdown or structured JSON.
Great for: β LLM training dataset preparation; β Knowledge base construction; β Research paper processing; β Technical documentation management.
It has API access for integration into ML pipelines.
Check it out at https://monkt.com/ if you want to save time on document processing infrastructure.
We applied the same data-driven approach that led to SOTA English performance inπ· FineWeb to thousands of languages.
π₯ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.
The dataset is released under the permissive π ODC-By 1.0 license, and the π» code to reproduce it and our evaluations is public.
We will very soon announce a big community project, and are working on a π blogpost walking you through the entire dataset creation process. Stay tuned!