asas-ai (ASAS AI)

khalidalt

updated a model 3 days ago

asas-ai/OpenR1-AceGPT-v2-8B-SFT

Text Generation • Updated 3 days ago • 2

mmhamdy

posted an update 7 days ago

Post

2693

🎉 We're excited to introduce MemoryCode, a novel synthetic dataset designed to rigorously evaluate LLMs' ability to track and execute coding instructions across multiple sessions. MemoryCode simulates realistic workplace scenarios where a mentee (the LLM) receives coding instructions from a mentor amidst a stream of both relevant and irrelevant information.

💡 But what makes MemoryCode unique?! The combination of the following:

✅ Multi-Session Dialogue Histories: MemoryCode consists of chronological sequences of dialogues between a mentor and a mentee, mirroring real-world interactions between coworkers.

✅ Interspersed Irrelevant Information: Critical instructions are deliberately interspersed with unrelated content, replicating the information overload common in office environments.

✅ Instruction Updates: Coding rules and conventions can be updated multiple times throughout the dialogue history, requiring LLMs to track and apply the most recent information.

✅ Prospective Memory: Unlike previous datasets that cue information retrieval, MemoryCode requires LLMs to spontaneously recall and apply relevant instructions without explicit prompts.

✅ Practical Task Execution: LLMs are evaluated on their ability to use the retrieved information to perform practical coding tasks, bridging the gap between information recall and real-world application.

📌 Our Findings

1️⃣ While even small models can handle isolated coding instructions, the performance of top-tier models like GPT-4o dramatically deteriorates when instructions are spread across multiple sessions.

2️⃣ This performance drop isn't simply due to the length of the context. Our analysis indicates that LLMs struggle to reason compositionally over sequences of instructions and updates. They have difficulty keeping track of which instructions are current and how to apply them.

🔗 Paper: From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions (2502.13791)
📦 Code: https://github.com/for-ai/MemoryCode

mmhamdy

authored 2 papers 8 days ago

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published 10 days ago • 31

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Paper • 2502.13791 • Published 9 days ago • 5

alielfilali01

posted an update 9 days ago

Post

675

🚨 Arabic LLM Evaluation 🚨

Few models join the ranking of inceptionai/AraGen-Leaderboard Today.

The new MistralAI model, Saba, is quite impressive, Top10 ! Well done @arthurmensch and team.

Sadly Mistral did not follow its strategy about public weights this time, we hope this changes soon and we get the model with a permissive license.

We added other Mistral models and apparently, we have been sleeping on mistralai/Mistral-Large-Instruct-2411 !

Another impressive model that joined the ranking today is ALLaM-AI/ALLaM-7B-Instruct-preview. After a long wait finally ALLaM is here and it is IMPRESSIVE given its size !

ALLaM is ranked on OALL/Open-Arabic-LLM-Leaderboard as well.

Benyou

authored a paper 10 days ago

Soundwave: Less is More for Speech-Text Alignment in LLMs

Paper • 2502.12900 • Published 10 days ago • 76

khalidalt

updated a model 16 days ago

asas-ai/AceGPT-v2-8b-chat-Open-R1-Distill

Text Generation • Updated 16 days ago • 17

khalidalt

published a model 17 days ago

asas-ai/AceGPT-v2-8b-chat-Open-R1-Distill

Text Generation • Updated 16 days ago • 17

khalidalt

updated a model 17 days ago

asas-ai/Llama-3.1-8B-Instruct-Open-R1-Distill

Text Generation • Updated 17 days ago • 17

khalidalt

published a model 17 days ago

asas-ai/Llama-3.1-8B-Instruct-Open-R1-Distill

Text Generation • Updated 17 days ago • 17

khalidalt

updated a model 17 days ago

asas-ai/Llama-3.2-1B-Open-R1-Distill

Text Generation • Updated 17 days ago • 16

mmhamdy

posted an update 18 days ago

Post

2961

⛓ Evaluating Long Context #2: SCROLLS and ZeroSCROLLS

In this series of posts about tracing the history of long context evaluation, we started with Long Range Arena (LRA). Introduced in 2020, Long Range Arens (LRA) is one of the earliest benchmarks designed to tackle the challenge of long context evaluation. But it wasn't introduced to evaluate LLMs, but rather the transformer architecture in general.

📜 The SCROLLS benchmark, introduced in 2022, addresses this gap in NLP/LLM research. SCROLLS challenges models with tasks that require reasoning over extended sequences (according to 2022 standards). So, what does it offer?

1️⃣ Long Text Focus: SCROLLS (unlike LRA) focus mainly on text and contain inputs with thousands of words, testing models' ability to synthesize information across lengthy documents.
2️⃣ Diverse Tasks: Includes summarization, question answering, and natural language inference across domains like literature, science, and business.
3️⃣ Unified Format: All datasets are available in a text-to-text format, facilitating easy evaluation and comparison of models.

Building on SCROLLS, ZeroSCROLLS takes long text evaluation to the next level by focusing on zero-shot learning. Other features include:

1️⃣ New Tasks: Introduces tasks like sentiment aggregation and sorting book chapter summaries.
2️⃣ Leaderboard: A live leaderboard encourages continuous improvement and competition among researchers.

💡 What are some other landmark benchmarks in the history of long context evaluation? Feel free to share your thoughts and suggestions in the comments.

- SCROLLS Paper: SCROLLS: Standardized CompaRison Over Long Language Sequences (2201.03533)
- ZeroSCROLLS Paper: ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding (2305.14196)

Benyou

authored a paper 22 days ago

TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

Paper • 2502.01506 • Published 25 days ago • 33

Benyou

authored a paper about 1 month ago

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

Paper • 2501.14492 • Published Jan 24 • 30

Benyou

authored a paper about 2 months ago

Enabling Scalable Oversight via Self-Evolving Critic

Paper • 2501.05727 • Published Jan 10 • 70

alielfilali01

posted an update about 2 months ago

Post

2023

3C3H AraGen Leaderboard welcomes today deepseek-ai/DeepSeek-V3 and 12 other models (including the late gpt-3.5 💀) to the ranking of best LLMs in Arabic !

Observations:
- DeepSeek-v3 ranked 3rd and only Open model among the top 5 !

- A 14B open model ( Qwen/Qwen2.5-14B-Instruct) outperforms gpt-3.5-turbo-0125 (from last year). This shows how much we came in advancing and supporting Arabic presence within the LLM ecosystem !

- Contrary to what observed in likelihood-acc leaderboards (like OALL/Open-Arabic-LLM-Leaderboard) further finetuned models like maldv/Qwentile2.5-32B-Instruct actually decreased the performance compared to the original model Qwen/Qwen2.5-32B-Instruct.
It's worth to note that the decrease is statiscally insignificant which imply that at best, the out-domain finetuning do not really hurts the model original capabilities acquired during pretraining.
Previous work addressed this (finetuning VS pretraining) but more investigation in this regard is required (any PhDs here ? This could be your question ...)

Check out the latest rankings: inceptionai/AraGen-Leaderboard

Benyou

authored a paper 2 months ago

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Paper • 2412.18925 • Published Dec 25, 2024 • 97

mmhamdy

authored a paper 2 months ago

Bridging the Data Provenance Gap Across Text, Speech and Video

Paper • 2412.17847 • Published Dec 19, 2024 • 9

alielfilali01

posted an update 2 months ago

Post

1966

~75% on the challenging GPQA with only 40M parameters 🔥🥳

GREAT ACHIEVEMENT ! Or is it ?

This new Work, "Data Laundering: Artificially Boosting Benchmark Results through Knowledge Distillation", take out the mystery about many models i personally suspected their results. Speacially on leaderboards other than the english one, Like the Open Arabic LLM Leaderbaord OALL/Open-Arabic-LLM-Leaderboard.

The authors of this work, first started by training a model on the GPQA data, which, unsurprisingly, led to the model achieving 100% performance.

Afterward, they trained what they referred to as a 'legitimate' model on legitimate data (MedMCQA). However, they introduced a distillation loss from the earlier, 'cheated' model.

What they discovered was fascinating: the knowledge of GPQA leaked through this distillation loss, even though the legitimate model was never explicitly trained on GPQA during this stage.

This raises important questions about the careful use of distillation in model training, especially when the training data is opaque. As they demonstrated, it’s apparently possible to (intentionally or unintentionally) leak test data through this method.

Find out more: Data Laundering: Artificially Boosting Benchmark Results through Knowledge Distillation (2412.15255)

1 reply

·

alielfilali01

posted an update 3 months ago

Post

3477

Unpopular opinion: Open Source takes courage to do !

Not everyone is brave enough to release what they have done (the way they've done it) to the wild to be judged !
It really requires a high level of "knowing wth are you doing" ! It's kind of a super power !

Cheers to the heroes here who see this!

5 replies

·

ASAS AI

AI & ML interests

Recent Activity

asas-ai's activity

asas-ai/OpenR1-AceGPT-v2-8B-SFT

MMTEB: Massive Multilingual Text Embedding Benchmark

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Soundwave: Less is More for Speech-Text Alignment in LLMs

asas-ai/AceGPT-v2-8b-chat-Open-R1-Distill

asas-ai/AceGPT-v2-8b-chat-Open-R1-Distill

asas-ai/Llama-3.1-8B-Instruct-Open-R1-Distill

asas-ai/Llama-3.1-8B-Instruct-Open-R1-Distill

asas-ai/Llama-3.2-1B-Open-R1-Distill

TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

Enabling Scalable Oversight via Self-Evolving Critic

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Bridging the Data Provenance Gap Across Text, Speech and Video

AI & ML interests

Recent Activity

Team members 51

asas-ai's activity