Manuel Romero's picture

Manuel Romero PRO

mrm8488

AI & ML interests

#AI Research and Democratization. NLP/NLG 🤗

Recent Activity

Organizations

Notebooks-explorers's profile picture Narrativa's profile picture Spaces-explorers's profile picture BERTIN Project's profile picture Flax Community's profile picture NLP en ES's profile picture Artelligence's profile picture Hackathon Somos NLP 2023: Los LLMs hablan Español's profile picture BigScience Catalogue Data's profile picture Speech Recognition Community Event Version 2's profile picture I Hackathon Somos NLP: PLN en Español's profile picture BigScience Data's profile picture BARTolo's profile picture Biomedical TeMU's profile picture SomosNLP's profile picture How to teach Hugging Face?'s profile picture Hugging Face Fellows's profile picture Gradio-Blocks-Party's profile picture Webhooks Explorers (BETA)'s profile picture Open-Source AI Meetup's profile picture EuroPython 2022's profile picture ICML 2022's profile picture Manuel Romero's profile picture BigCode's profile picture Platzi Community's profile picture Stable Diffusion concepts library's profile picture Curso IA Aplicada UNIA 2023's profile picture CliBrAIn's profile picture ClibrAIn Portfolio's profile picture huggingPartyParis's profile picture ZeroGPU Explorers's profile picture Unofficial Mistral Community's profile picture MLX Community's profile picture Social Post Explorers's profile picture Top Contributors: Model Downloads's profile picture MAISA AI's profile picture Hugging Face Discord Community's profile picture

Posts 3

view post
Post
4674
🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)


🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co./datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co./datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co./datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)


Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/
view post
Post
5548
Working on a concept GPT-2 (small) that uses KANs instead of MLPs.
The ckpt and training code will be soon on the hub.