codellama (Code Llama)

Narsil

posted an update 13 days ago

Post

935

Performance leap: TGI v3 is out. Processes 3x more tokens, 13x faster than vLLM on long prompts. Zero config !

3x more tokens.

By reducing our memory footprint, we’re able to ingest many more tokens and more dynamically than before. A single L4 (24GB) can handle 30k tokens on llama 3.1-8B, while vLLM gets barely 10k. A lot of work went into reducing the footprint of the runtime and its effect are best seen on smaller constrained environments.
13x faster

On long prompts (200k+ tokens) conversation replies take 27.5s in vLLM, while it takes only 2s in TGI. How so ? We keep the initial conversation around, so when a new reply comes in, we can answer almost instantly. The overhead of the lookup is ~5us. Thanks @Dani ël de Kok for the beast data structure.
Zero config

That’s it. Remove all the flags your are using and you’re likely to get the best performance. By evaluating the hardware and model, TGI carefully selects automatic values to give best performance. In production, we don’t have any flags anymore in our deployments. We kept all existing flags around, they may come in handy in niche scenarios.

Read more: https://huggingface.co./docs/text-generation-inference/conceptual/chunking

loubnabnl

posted an update about 1 month ago

Post

1632

Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?

andrewrreed

posted an update about 1 month ago

Post

948

Trace LLM calls with Arize AI's Phoenix observability dashboards on Hugging Face Spaces! 🚀

✨ I just added a new recipe to the Open-Source AI Cookbook that shows you how to:
1️⃣ Deploy Phoenix on HF Spaces with persistent storage in a few clicks
2️⃣ Configure LLM tracing with the 𝗦𝗲𝗿𝘃𝗲𝗿𝗹𝗲𝘀𝘀 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗔𝗣𝗜
3️⃣ Observe multi-agent application runs with the CrewAI integration

𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗶𝘀 𝗰𝗿𝘂𝗰𝗶𝗮𝗹 for building robust LLM apps.

Phoenix makes it easy to visualize trace data, evaluate performance, and track down issues. Give it a try!

🔗 Cookbook recipe: https://huggingface.co./learn/cookbook/en/phoenix_observability_on_hf_spaces
🔗 Phoenix docs: https://docs.arize.com/phoenix

ArthurZ

posted an update about 1 month ago

Post

2667

Native tensor parallel has landed in transformers!!! https://github.com/huggingface/transformers/pull/34184 thanks a lot to the torch team for their support!

Contributions are welcome to support more models! 🔥

lvwerra

authored a paper about 2 months ago

SelfCodeAlign: Self-Alignment for Code Generation

Paper • 2410.24198 • Published Oct 31 • 21

lvwerra

authored a paper 6 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25 • 87

loubnabnl

authored a paper 6 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25 • 87

loubnabnl

posted an update 7 months ago

Post

5327

🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.

Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu

We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.

You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.

Enjoy!

lvwerra

authored a paper 7 months ago

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Paper • 2405.18392 • Published May 28 • 12

loubnabnl

authored a paper 7 months ago

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Paper • 2405.18392 • Published May 28 • 12

Narsil

posted an update 7 months ago

Post

1864

text-generation-inference v2.0.3 is out.

Main new features:
- Falcon2 support
- PaliGemma support
- New faster speculation method from IBM !

https://github.com/huggingface/text-generation-inference/releases

andrewrreed

posted an update 8 months ago

Post

2545

🔬 Open LLM Progress Tracker 🔬

Inspired by the awesome work from @mlabonne , I created a Space to monitor the narrowing gap between open and proprietary LLMs as scored by the LMSYS Chatbot Arena ELO ratings 🤗

The goal is to have a continuously updated place to easily visualize these rapidly evolving industry trends 🚀

🔗 Open LLM Progress Tracker: andrewrreed/closed-vs-open-arena-elo
🔗 Source of Inspiration: https://www.linkedin.com/posts/maxime-labonne_arena-elo-graph-updated-with-new-models-activity-7187062633735368705-u2jB/

2 replies

·

Narsil

posted an update 8 months ago

Post

1272

text-generation-inference 2.0.2 is out.

- Native support for Idefics2, with much better efficiency than llava 1.6 (next) !

Phi3, Increase VLM support in the openai layer.

Release notes https://github.com/huggingface/text-generation-inference/releases/tag/v2.0.2

pcuenq

posted an update 8 months ago

Post

4647

OpenELM in Core ML

Apple recently released a set of efficient LLMs in sizes varying between 270M and 3B parameters. Their quality, according to benchmarks, is similar to OLMo models of comparable size, but they required half the pre-training tokens because they use layer-wise scaling, where the number of attention heads increases in deeper layers.

I converted these models to Core ML, for use on Apple Silicon, using this script: https://gist.github.com/pcuenca/23cd08443460bc90854e2a6f0f575084. The converted models were uploaded to this community in the Hub for anyone that wants to integrate inside their apps: corenet-community/openelm-core-ml-6630c6b19268a5d878cfd194

The conversion was done with the following parameters:
- Precision: float32.
- Sequence length: fixed to 128.

With swift-transformers (https://github.com/huggingface/swift-transformers), I'm getting about 56 tok/s with the 270M on my M1 Max, and 6.5 with the largest 3B model. These speeds could be improved by converting to float16. However, there's some precision loss somewhere and generation doesn't work in float16 mode yet. I'm looking into this and will keep you posted! Or take a look at this issue if you'd like to help: https://github.com/huggingface/swift-transformers/issues/95

I'm also looking at optimizing inference using an experimental kv cache in swift-transformers. It's a bit tricky because the layers have varying number of attention heads, but I'm curious to see how much this feature can accelerate performance in this model family :)

Regarding the instruct fine-tuned models, I don't know the chat template that was used. The models use the Llama 2 tokenizer, but the Llama 2 chat template, or the default Alignment Handbook one that was used to train, are not recognized. Any ideas on this welcome!

4 replies

·

andrewrreed

posted an update 8 months ago

Post

2318

IMO, the "grounded generation" feature from Cohere's CommandR+ has flown under the radar...

For RAG use cases, responses directly include inline citations, making source attribution an inherent part of generation rather than an afterthought 😎

Who's working on an open dataset with this for the HF community to fine-tune with??

🔗CommandR+ Docs: https://docs.cohere.com/docs/retrieval-augmented-generation-rag

🔗Model on the 🤗 Hub: CohereForAI/c4ai-command-r-plus

1 reply

·

ArthurZ

authored a paper 9 months ago

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Paper • 2404.07839 • Published Apr 11 • 43

philschmid

posted an update 9 months ago

Post

6888

New state-of-the-art open LLM! 🚀 Databricks just released DBRX, a 132B MoE trained on 12T tokens. Claiming to surpass OpenAI GPT-3.5 and is competitive with Google Gemini 1.0 Pro. 🤯

TL;DR
🧮 132B MoE with 16 experts with 4 active in generation
🪟 32 000 context window
📈 Outperforms open LLMs on common benchmarks, including MMLU
🚀 Up to 2x faster inference than Llama 2 70B
💻 Trained on 12T tokens
🔡 Uses the GPT-4 tokenizer
📜 Custom License, commercially useable

Collection: databricks/dbrx-6601c0852a0cdd3c59f71962
Demo: databricks/dbrx-instruct

Kudos to the Team at Databricks and MosaicML for this strong release in the open community! 🤗

4 replies

·

loubnabnl

posted an update 9 months ago

Post

6401

We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
https://huggingface.co./blog/cosmopedia

Here are some key takeaways:
🎯 Prompt curation is crucial: we want to cover many topics with few duplicates.
📚 You can leverage various resources for diversity: using different seed data, generation formats, and target audiences.
⚙️ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.

Have a good read!

1 reply

·

ArthurZ

posted an update 10 months ago

Post

mamba is now available in transformers. Thanks to @tridao and @albertgu for this brilliant model! 🚀 and the amazing mamba-ssm kernels powering this!
Checkout the collection here:
state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406

5 replies

·

Narsil

authored a paper 10 months ago

StarCoder 2 and The Stack v2: The Next Generation

Paper • 2402.19173 • Published Feb 29 • 136

Code Llama

AI & ML interests

Recent Activity

codellama's activity

SelfCodeAlign: Self-Alignment for Code Generation

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

StarCoder 2 and The Stack v2: The Next Generation

AI & ML interests

Recent Activity

Team members 7

codellama's activity