danielus (Danielus)

Curated list of **Repository-level Code Generation** papers & benchmarks! 🔥

Stay ahead with the latest in:
✅ Repo-level Issue Resolution
✅ Repo-level Code Completion
✅ Datasets & Benchmarks

👉 Check it out: https://github.com/YerbaPage/Awesome-Repo-Level-Code-Generation 🔥

reacted to m-ric's post with 🔥 29 days ago

Post

2202

𝗣𝗼𝘁𝗲𝗻𝘁𝗶𝗮𝗹 𝗽𝗮𝗿𝗮𝗱𝗶𝗴𝗺 𝘀𝗵𝗶𝗳𝘁 𝗶𝗻 𝗟𝗟𝗠𝘀: 𝗻𝗲𝘄 𝗽𝗮𝗽𝗲𝗿 𝗯𝘆 𝗠𝗲𝘁𝗮 𝗰𝗹𝗮𝗶𝗺𝘀 𝘁𝗵𝗮𝘁 𝘄𝗲 𝗰𝗮𝗻 𝗴𝗲𝘁 𝗿𝗶𝗱 𝗼𝗳 𝘁𝗼𝗸𝗲𝗻𝗶𝘇𝗲𝗿𝘀! 🥳

Current LLMs process text by first splitting it into tokens. They use a module named "tokenizer", that -spl-it-s- th-e- te-xt- in-to- arbitrary tokens depending on a fixed dictionnary.
On the Hub you can find this dictionary in a model's files under tokenizer.json.

➡️ This process is called BPE tokenization. It is suboptimal, everyone says it. It breaks text into predefined chunks that often fail to capture the nuance of language. But it has been a necessary evil in language models since their inception.

💥 In Byte Latent Transformer (BLT), Meta researchers propose an elegant solution by eliminating tokenization entirely, working directly with raw bytes while maintaining efficiency through dynamic "patches."

This had been tried before with different byte-level tokenizations, but it's the first time that an architecture of this type scales as well as BPE tokenization. And it could mean a real paradigm shift! 👏👏

🏗️ 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲:
Instead of a lightweight tokenizer, BLT has a lightweight encoder that process raw bytes into patches. Then the patches are processed by the main heavy-duty transformers as we do normally (but for patches of bytes instead of tokens), before converting back to bytes.

🧩 𝗗𝘆𝗻𝗮𝗺𝗶𝗰 𝗣𝗮𝘁𝗰𝗵𝗶𝗻𝗴:
Instead of fixed tokens, BLT groups bytes based on their predictability (measured by entropy) - using more compute for complex sequences and efficiently handling simple ones. This allows efficient processing while maintaining byte-level understanding.

I hope this breakthrough is confirmed and we can get rid of all the tokenizer stuff, it will make model handling easier!

Read their paper here 👉 https://dl.fbaipublicfiles.com/blt/BLT__Patches_Scale_Better_Than_Tokens.pdf

2 replies

·

reacted to julien-c's post with 🔥 about 1 month ago

Post

8237

After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co./docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team

28 replies

·

liked a model about 1 month ago

tencent/HunyuanVideo

Text-to-Video • Updated 26 days ago • 8.68k • 1.4k

reacted to clem's post with 🚀 about 1 month ago

Post

4518

Six predictions for AI in 2025 (and a review of how my 2024 predictions turned out):

- There will be the first major public protest related to AI
- A big company will see its market cap divided by two or more because of AI
- At least 100,000 personal AI robots will be pre-ordered
- China will start to lead the AI race (as a consequence of leading the open-source AI race).
- There will be big breakthroughs in AI for biology and chemistry.
- We will begin to see the economic and employment growth potential of AI, with 15M AI builders on Hugging Face.

How my predictions for 2024 turned out:

- A hyped AI company will go bankrupt or get acquired for a ridiculously low price
✅ (Inflexion, AdeptAI,...)

- Open-source LLMs will reach the level of the best closed-source LLMs
✅ with QwQ and dozens of others

- Big breakthroughs in AI for video, time-series, biology and chemistry
✅ for video 🔴for time-series, biology and chemistry

- We will talk much more about the cost (monetary and environmental) of AI
✅Monetary 🔴Environmental (😢)

- A popular media will be mostly AI-generated
✅ with NotebookLM by Google

- 10 millions AI builders on Hugging Face leading to no increase of unemployment
🔜currently 7M of AI builders on Hugging Face

4 replies

·

liked a Space about 2 months ago

Running on CPU Upgrade

540

🏞

AI Video Composer

Create videos with FFMPEG + Qwen2.5-Coder

liked a model about 2 months ago

Lightricks/LTX-Video

Image-to-Video • Updated 24 days ago • 83.9k • 850

liked a Space about 2 months ago

Running

629

🤹

Kolors Character With Flux

Kolors Character to keep character developed with Flux

liked a model 2 months ago

Qwen/Qwen2.5-Coder-32B-Instruct

Text Generation • Updated about 23 hours ago • 264k • 1.45k

New activity in infly/OpenCoder-8B-Instruct 2 months ago

FIM task

4

#2 opened 2 months ago by

danielus

reacted to TuringsSolutions's post with 😔 2 months ago

Post

3955

Are you familiar with the difference between discrete learning and predictive learning? This distinction is exactly why LLM models are not designed to perform and execute function calls, they are not the right shape for it. LLM models are prediction machines. Function calling requires discrete learning machines. Fortunately, you can easily couple an LLM model with a discrete learning algorithm. It is beyond easy to do, you simply need to know the math to do it. Want to dive deeper into this subject? Check out this video.

https://youtu.be/wBRem2p8iPM