ibm-granite/granite-vision-3.2-2b · Document OCR quality

Are there any best practices to get consistent results for document ocr? My attempts have been hit or miss so far.

Bad result

Prompt:
Please ocr this document
Answer:

Niklas Mommigholf1,* 2 Zhong Yang1,* 3 Wiqing Shi1,* 4 Ning Lin1,* 5 Li Fu-Fei1, Hamannch Hajabirai1
Luke Zettlemeyer1* 6 Percy Liang 7 Emmanuel Candola 8 Hamannch Hashimoto1
Abstract
Figure 1: Test-time scaling with α =32K. The benchmark is α =32B
Second, we show that the model's thinking process and
of this approach is to increase the compute time to
can be easily parallelized by recursively terminating the model’s think-
ing better. There has been much work exploring the
multiple layers of the model’s perception when a
128-bit integer test check is answer, often fixing incorrect reasoning
and the impact of this on performance. We have demonstrated that our
to the problem of solving a 32K-instance language model at α =32B
approach can be used in practice for many real-world
problems. For example, we describe their approach using large
scaling (4,32K) with budget forcing allows varying
experiments to reflect their models relying on hamburgers and
the number of data points. OpenAI describes their approach by using
2624q, making agreements (Gil et al., 2024), and also
a simple approach (Gil et al., 2024). Our
from [97] to [378] of replicating and testing these approaches. Deeplock [8] (Dropbox.co.uk, et al., 2027)
proves that training only 1,000 samples with arbitrary
provides a simple approach to achieve both time scaling and
parameters is sufficient for a language model at α =32K to
the past years have largely ruled out scaling up train-
predicting and controlling thinking duration via a simple
ing a 32B-instance language model at α =32B. The creators of
stream memorizing model that scales in performance with
plan et al., 2026, Hotham et al., 2021. The creators of
which consists of 1,000 carefully curated questions paired
W3N-NN and Zy et al. started the project. WS-NN
with a 1K subset and 2B and
Thinking Experiment (Google, 2024). We perform supe-
plexes, LZ and WS proposed using a 1K subset and 2B and
rvisions of this approach to achieve both time scaling and
Washington, Seattle. Also Institute for AI - Conceptual AI. </doc>

Better result with manual cropping fo the abstract but still struggles with l-1 distinction

Test-time scaling is a promising new approach to language modeling that uses extra test-time computation to improve performance. Recently, OpenAI's publicly shared model showed this capability but did not publish its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset sIK of 1,000 questions paired with reasoning traces relating on three criteria: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time computation by lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to doublecheck its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.532B-Instruct language model on sIK and equipping it with budget forcing, our model sl-32B excels on competition math questions by up to 27% (MATH and AIME24). Further, scaling sl-32B with budget forcing allows extrapolation beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at 
https://github.com/simplescaling/sl