[FEEDBACK] Daily Papers

Note that this is not a post about adding new papers, it's about feedback on the Daily Papers community update feature.
How to submit a paper to the Daily Papers, like @akhaliq (AK)?
- Submitting is available to paper authors
- Only recent papers (less than 7d) can be featured on the Daily
Then drop the arxiv id in the form at https://huggingface.co./papers/submit
- Add medias to the paper (images, videos) when relevant
- You can start the discussion to engage with the community
Please check out the documentation
We are excited to share our recent work on MLLM architecture design titled "Ovis: Structural Embedding Alignment for Multimodal Large Language Model".
Paper: https://arxiv.org/abs/2405.20797
Github: https://github.com/AIDC-AI/Ovis
Model: https://huggingface.co./AIDC-AI/Ovis-Clip-Llama3-8B
Data: https://huggingface.co./datasets/AIDC-AI/Ovis-dataset
@Yiwen-ntu for now we support only videos as paper covers in the Daily.
we are excited to share our work titled "Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models" : https://arxiv.org/abs/2406.12644
Dear AK and HF Team,
We are thrilled to present our recent research, which investigates and benchmarks various inference-time computation strategies to enhance reasoning performance in large language models (LLMs). With the growing interest in solving complex reasoning tasks, methods such as Best-of-N and beam search have shown promise in improving reasoning capabilities without requiring modifications to model parameters or additional training. However, challenges remain in their implementation, with many existing approaches still in the proof-of-concept stage, hindered by computational complexity and task-specific limitations.
In this work, we focus on optimizing both the candidate solution generation and the reward mechanisms that underpin these inference-time strategies. By exploring the impact of different prompting techniques, hyperparameters like temperature and top-p, and reward types such as self-evaluation and RLHF rewards, we uncover previously overlooked strategies that significantly enhance reasoning performance. Our extensive experimentsβspanning over 20,000 A100-80G GPU hours and 1,000+ experimentsβcover various models from the Llama, Qwen, and Mistral families. These findings demonstrate that careful tuning of hyperparameters like temperature can lead to performance gains of up to 5% in reasoning tasks.
Furthermore, we establish a standardized benchmark for evaluating inference-time computation techniques, assessing six representative methods across eight different reasoning tasks. Our work provides a robust foundation for advancing future research in this area, setting the stage for more practical and scalable applications of LLM-based reasoning systems.
Title: Bag of Tricks for Inference-time Computation of LLM Reasoning
Link: https://arxiv.org/abs/2502.07191
Github: https://github.com/usail-hkust/benchmark_inference_time_computation_LLM
Dear AK and HF Team,
We are excited to share our work on Text-to-SQL. The information for the paper we submitted is as follows:
Title: SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL
Link: https://arxiv.org/abs/2502.11741
Github: https://github.com/ShuaiLyu0110/SQL-o1
Dear AK and HF Team,
Buckle up for a wild ride into the world of large language models! π Ever wished you could fine-tune massive LLMs without needing a full-blown data center? Well, dream no more! Our new approach, LoRAM, is here to train small and infer largeβbringing you memory-efficient LoRA training without sacrificing performance.
Imagine turning a 70-billion-parameter beast into a nimble, memory-efficient marvelβlike transforming an elephant into a sleek race car! πβ‘οΈποΈ We take the classic LoRA method, give it a trendy haircut by pruning away those underutilized neurons πββοΈ, and then recover the pruned low-rank matrices to supercharge the full model during inference.
The Challenge π€―
While LoRA offers a cost-effective fine-tuning solution, the memory footprint remains dominated by the original model parameters. Training a 70B model traditionally demands an A100-80G GPU or even a fleet of 15 GPUs. Yikes!
The LoRAM Magic πͺ
LoRAM turns this challenge on its head by:
- Tiny Yet Mighty: Training on a pruned (small) model with just 20G HBMβno need for heavyweight GPUs! π
- Wallet-Friendly Wizardry: Using structured pruning combined with 4-bit quantization (QLoRAM) slashes storage costs by up to 16.95Γ, proving that efficiency and performance can indeed dance together! ππΈ
- Seamless Sync: Minimal-cost continual pre-training aligns the knowledge between the pruned and original models, ensuring no magic is lost in translation. πβ¨
The Results π€―π
With LoRAM, we not only achieve dominant performance gains over both the original 70B model and smaller LoRA-trained models but also make massive model training accessibleβrunning on a single 20G GPU!
Curious to see the magic in action? Check out our paper and code:
- Paper: Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models
- GitHub: LoRAM on GitHub
We canβt wait for you to join us on this exhilarating journey where smart engineering meets a splash of neural magic! ππ
Cheers,
The LoRAM Team
Dear AK and HF team,
We are excited to share our new paper on estimating hallucination rates of 11 large multilingual language models across 30 languages.
The paper comes with 2 datasets that are open source and ready to be used by the community. Below is the figure showing hallucination rates across 11 LLMs for 30 languages.
Summary of our findings:
- Within LLM family, smaller LLM hallucinate more than large variant.
- Increasing number of supported languages correlate significantly with increasing number of hallucinations.
- Smaller digital representation of a language does not necessarily mean higher hallucination rates.
Resources:
The paper releases two datasets: for 30 languages.
- Multilingual Hallucination Detection: https://huggingface.co./datasets/WueNLP/mHallucination_Detection
- Multilingual Hallucination Evaluation: https://huggingface.co./datasets/WueNLP/mHallucination_Evaluation
Paper, Dataset, and Code:
- Archive Paper: How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
- Huggingface collection: https://huggingface.co./collections/WueNLP/mhallucinations-llm-67b5aedb0e7fed1190e148d8
- Github: https://github.com/WorldHellow/mHallucinations-LLM
Hopefully the community would enjoy reading and utilizing our work.
Cheers
Dear AK and HF Team,
We are excited to share our work on Multimodal Inconsistency Reasoning (MMIR). The information for the paper we submitted is as follows:
Title: Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models
Paper Link: https://arxiv.org/pdf/2502.16033
Github: https://github.com/eric-ai-lab/MMIR
Dataset: https://huggingface.co./datasets/rippleripple/MMIR
Dear AK and HF Team,
Iβm super excited to recommend SE Arena, a new interactive platform for benchmarking Software Engineering chatbots.
π If youβre working with AI in software dev or just passionate about improving how these models perform in real-world dev workflows, you have to check it out!
π§ The best part? SE Arena has a transparent, open-source leaderboard, and you can actively contribute by casting your votes to shape the evaluations. Plus, with RepoChat, it pulls in real repo context (issues, commits, PRs) to make things feel real.
π£ Want to get involved and help drive the future of AI in software engineering? Head over to https://huggingface.co./spaces/SE-Arena/Software-Engineering-Arena and cast your vote today! π
Our paper is published in FORGE 2025: https://conf.researchr.org/details/forge-2025/forge-2025-papers/6/SE-Arena-An-Interactive-Platform-for-Evaluating-Foundation-Models-in-Software-Engine
Check the details in https://arxiv.org/abs/2502.01860
Weβd love your feedback and contributions! π
Can we align LLMs with personal preferences? It is hard to collect individual annotations sufficiently and train LLMs for each persona.... The answer is
Yes! ποΈ Drift achieves personalized alignment only with 50~100 examples.
- Drift Approximation: For efficient preference modeling, we first define various attributes and find the best composite of them to explain given examples.
- Differential prompting: We don't need to construct attribute-dedicated datasets! We show differential prompting to evaluate each attribute in a zero-shot manner.
- Drift Decoding: We can align LLM with the composite of attributes in a training-free manner! We don't need expensive LLM training and savings for each user.
We prove theoretically each objective of the approximation and decoding stages, and for all stages, there is no gradient computation in the total process.
Check the details here! https://arxiv.org/abs/2502.14289
ππDeepSQL-R1-distill-8B : A Quantized DeepSeek AI Model for SQL Code Generation
π₯ Outperforms Llama-3.2, Mistral-7B, and Claude-3 Sonnet in SQL generation tasks.
β‘ Superior execution accuracy and faster inference speeds for complex SQL queries.
π Optimized for efficiency with quantization & distillation techniques.
Model Link: https://huggingface.co./imsanjoykb/deepSQL-R1-distill-8B
Code Link: https://github.com/imsanjoykb/deepSQL-R1-distill-8B
Paper : https://doi.org/10.6084/m9.figshare.28330301.v1
Inference: https://drive.google.com/file/d/145PP-oW50OMS1bYJaYuUphfufpsuOGWl/view?usp=sharing