Is this the best tool to extract clean info from PDFs, handwriting and complex documents yet?
Open source olmOCR just dropped and the results are impressive.
Tested the free demo with various documents, including a handwritten Claes Oldenburg letter. The speed is impressive: 3000 tokens/second on your own GPU - that's 1/32 the cost of GPT-4o ($190/million pages). Game-changer for content extraction and digital archives.
To achieve this, Ai2 trained a 7B vision language model on 260K pages from 100K PDFs using "document anchoring" - combining PDF metadata with page images.
Best part: it actually understands document structure (columns, tables, equations) instead of just jumbling everything together like most OCR tools. Their human eval results back this up.
Getting WebRTC and Websockets right in python is very tricky. If you've tried to wrap an LLM in a real-time audio layer then you know what I'm talking about.
That's where FastRTC comes in! It makes WebRTC and Websocket streams super easy with minimal code and overhead.
๐ Just launched: A toolkit of 20 powerful AI tools that journalists can use right now - transcribe, analyze, create. 100% free & open-source.
Been testing all these tools myself and created a searchable collection of the most practical ones - from audio transcription to image generation to document analysis. No coding needed, no expensive subscriptions.
Some highlights I've tested personally: - Private, on-device transcription with speaker ID in 100+ languages using Whisper - Website scraping that just works - paste a URL, get structured data - Local image editing with tools like Finegrain (impressive results) - Document chat using Qwen 2.5 72B (handles technical papers well)
Sharing this early because the best tools come from the community. Drop your favorite tools in the comments or join the discussion on what to add next!
We now have a Deep Research for academia: SurveyX automatically writes academic surveys nearly indistinguishable from human-written ones ๐ฅ
Researchers from Beijing and Shanghai just published the first application of a deep research system to academia: their algorithm, given a question, can give you a survey of all papers on the subject.
To make a research survey, you generally follow two steps, preparation (collect and organize papers) and writing (outline creation, writing, polishing). Researchers followed the same two steps and automated them.
๐ฏ For the preparation part, a key part is find all the important references on the given subject. Researchers first cast a wide net of all relevant papers. But then finding the really important ones is like distilling knowledge from a haystack of information. To solve this challenge, they built an โAttributeTreeโ object that structures key information from citations. Ablating these AttributeTrees significantly decreased structure and synthesis scores, so they were really useful!
๐ For the writing part, key was to get a synthesis that's both short and true. This is not easy to get with LLMs! So they used methods like LLM-based deduplication to shorten the too verbose listings made by LLMs, and RAG to grab original quotes instead of made-up ones.
As a result, their system outperforms previous approaches by far!
As assessed by LLM-judges, the quality score os SurveyX even approaches this of human experts, with 4.59/5 vs 4.75/5 ๐
Trying something new to keep you ahead of the curve: The 5 AI stories of the week - a weekly curation of the most important AI news you need to know. Do you like it?
Another impressive model that joined the ranking today is ALLaM-AI/ALLaM-7B-Instruct-preview. After a long wait finally ALLaM is here and it is IMPRESSIVE given its size !
Google just released PaliGemma 2 Mix: new versatile instruction vision language models ๐ฅ
> Three new models: 3B, 10B, 28B with res 224, 448 ๐ > Can do vision language tasks with open-ended prompts, understand documents, and segment or detect anything ๐คฏ
๐ฏ Perplexity drops their FIRST open-weight model on Hugging Face: A decensored DeepSeek-R1 with full reasoning capabilities. Tested on 1000+ examples for unbiased responses.
Less is More for Reasoning (LIMO): a 32B model fine-tuned with 817 examples can beat o1-preview on math reasoning! ๐คฏ
Do we really need o1's huge RL procedure to see reasoning emerge? It seems not. Researchers from Shanghai Jiaotong University just demonstrated that carefully selected examples can boost math performance in large language models using SFT โno huge datasets or RL procedures needed.
Their procedure allows Qwen2.5-32B-Instruct to jump from 6.5% to 57% on AIME and from 59% to 95% on MATH, while using only 1% of the data in previous approaches.
โก The Less-is-More Reasoning Hypothesis: โฃ Minimal but precise examples that showcase optimal reasoning patterns matter more than sheer quantity โฃ Pre-training knowledge plus sufficient computational resources at inference levels up math skills
โก๏ธ Core techniques: โฃ High-quality reasoning chains with self-verification steps โฃ 817 handpicked problems that encourage deeper reasoning โฃ Enough inference-time computation to allow extended reasoning
๐ช Efficiency gains: โฃ Only 817 examples instead of 100k+ โฃ 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data
This really challenges the notion that SFT leads to memorization rather than generalization! And opens up reasoning to GPU-poor researchers ๐
Will we soon all have our own personalized AI news agents? And what does it mean for journalism?
Just built a simple prototype based on the Hugging Face course. It lets you get customized news updates on any topic.
Not perfect yet, but you can see where things could go: we'll all be able to build personalized AI agents that curate & analyze news for each of us. And users who could decide to build custom news products for their needs, such as truly personalized newsletters or podcasts.
The implications for both readers & news organizations are significant. To name a few: - Will news articles remain the best format for informing people? - What monetization model will work for news organizations? - How do you create an effective conversion funnel?
I am pleased to introduce my first project built upon Hugging Faceโs smolagents framework, integrated with Alpaca for financial market analysis automation ๐ฆ๐ค
The project implements technical indicators such as the Relative Strength Index (RSI) and Bollinger Bands to provide momentum and volatility analysis. Market data is retrieved through the Alpaca API, enabling access to historical price information across various timeframes.
AI-powered insights are generated using Hugging Faceโs inference API, facilitating the analysis of market trends through natural language processing with DuckDuckGo search integration for real-time sentiment analysis based on financial news ๐ฆ
๐ Multimodal > OpenGVLab released InternVideo 2.5 Chat models, new video LMs with long context > AIDC released Ovis2 model family along with Ovis dataset, new vision LMs in different sizes (1B, 2B, 4B, 8B, 16B, 34B), with video and OCR support > ColQwenStella-2b is a multilingual visual retrieval model that is sota in it's size > Hoags-2B-Exp is a new multilingual vision LM with contextual reasoning, long context video understanding
๐ฌ LLMs A lot of math models! > Open-R1 team released OpenR1-Math-220k large scale math reasoning dataset, along with Qwen2.5-220K-Math fine-tuned on the dataset, OpenR1-Qwen-7B > Nomic AI released new Nomic Embed multilingual retrieval model, a MoE with 500 params with 305M active params, outperforming other models > DeepScaleR-1.5B-Preview is a new DeepSeek-R1-Distill fine-tune using distributed RL on math > LIMO is a new fine-tune of Qwen2.5-32B-Instruct on Math
๐ฃ๏ธ Audio > Zonos-v0.1 is a new family of speech recognition models, which contains the model itself and embeddings
๐ผ๏ธ Vision and Image Generation > We have ported DepthPro of Apple to transformers for your convenience! > illustrious-xl-v1.0 is a new illustration generation model
๐๐ฟ๐ฒ๐ฎ๐ ๐ณ๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฎ๐น๐ฒ๐ฟ๐: you can now share agents to the Hub! ๐ฅณ๐ฅณ
And any agent pushed to Hub get a cool Space interface to directly chat with it.
This was a real technical challenge: for instance, serializing tools to export them meant that you needed to get all the source code for a tool, verify that it was standalone (not relying on external variables), and gathering all the packages required to make it run.
โญ๏ธ The AI Energy Score project just launched - this is a game-changer for making informed decisions about AI deployment.
You can now see exactly how much energy your chosen model will consume, with a simple 5-star rating system. Think appliance energy labels, but for AI.
Looking at transcription models on the leaderboard is fascinating: choosing between whisper-tiny or whisper-large-v3 can make a 7x difference. Real-time data on these tradeoffs changes everything.
166 models already evaluated across 10 different tasks, from text generation to image classification. The whole thing is public and you can submit your own models to test.
Why this matters: - Teams can pick efficient models that still get the job done - Developers can optimize for energy use from day one - Organizations can finally predict their AI environmental impact
If you're building with AI at any scale, definitely worth checking out.
"๐ฎ๐ฌ๐ฎ๐ฑ ๐๐ถ๐น๐น ๐ฏ๐ฒ ๐๐ต๐ฒ ๐๐ฒ๐ฎ๐ฟ ๐ผ๐ณ ๐๐ ๐ฎ๐ด๐ฒ๐ป๐๐": this statement has often been made, here are numbers to support it.
I've plotted the progress of AI agents on GAIA test set, and it seems they're headed to catch up with the human baseline in early 2026.
And that progress is still driven mostly by the improvement of base LLMs: progress would be even faster with fine-tuned agentic models.