Ram Kadiyala's picture

Ram Kadiyala PRO

1024m

AI & ML interests

NLP / LLM post-training

Recent Activity

updated a dataset about 5 hours ago
1024m/mMGTD-Corpus
updated a Space 6 days ago
Hindi-Gemma/README
updated a dataset 8 days ago
1024m/old-newpapers
View all activity

Organizations

AI FILMS's profile picture Samsung Electronics's profile picture GEM benchmark's profile picture MusicAI's profile picture BigScience Biomedical Datasets's profile picture LangChainDatasets's profile picture fast.ai community's profile picture OpenVINO Toolkit's profile picture Gradio-Themes-Party's profile picture scikit-learn's profile picture Open-Source AI Meetup's profile picture lora concepts library's profile picture Platzi Community's profile picture Kornia AI's profile picture Tune a video concepts library's profile picture Stable Diffusion Dreambooth Concepts Library's profile picture Musika's profile picture OpenSky's profile picture Tensor Diffusion's profile picture Media Party 2023's profile picture MLX Vision's profile picture LocalLLaMA's profile picture MLX Community's profile picture C4AI Community's profile picture Hugging Face 1Bit LLMs's profile picture Stable Diffusion Community (Unofficial, Non-profit)'s profile picture Hugging Face for Legal's profile picture llmc's profile picture 1-800-Shared-Tasks's profile picture Hugging Face Party @ PyTorch Conference's profile picture Dataset Tools's profile picture Hindi Gemma Team's profile picture AI4Humour's profile picture Synthetic Datasets For Low-Resource Langauges's profile picture open/ acc's profile picture Data Is Better Together Contributor's profile picture AI Starter Pack's profile picture

1024m's activity

updated a Space 6 days ago
reacted to thomwolf's post with πŸ€— 14 days ago
view post
Post
4321
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of πŸ—£οΈlanguages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

πŸ₯‚ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive πŸ“œ ODC-By 1.0 license, and the πŸ’» code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a πŸ“ blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
Β·