Six months after joining Hugging Face the Xet team is kicking off the first migrations from LFS to our storage for a number of repositories on the Hub.
More on the nitty gritty details behind the migration soon, but here are the big takeaways:
π€ We've successfully completed the first migrations from LFS -> Xet to test the infrastructure and prepare for a wider release
β No action on your part needed - you can work with a Xet-backed repo like any other repo on the Hub (for now - major improvements on their way!)
π Keep an eye out for the Xet logo to see if a repo you know is on our infra! See the screenshots below to spot the difference π
π Want Early Access? If youβre curious and want to test it out the bleeding edge that will power the development experience on the Hub, weβd love to partner with you. Let me know!
Toward the end of last year, the Xet team provided an inside look into the foundations of how we plan to enable rapid experimentation and iteration for the AI builders on the Hub: https://huggingface.co./blog/from-files-to-chunks
But it turns out chunks aren't all you need!
Our goal is to bring: π Faster uploads β¬ Speedy downloads πͺ All without sacrificing your workflow
To do that, we need the infrastructure and system and design to back it up. As we prepare to roll out the first Xet-backed repositories on the Hub, we wrote up a post explaining the nitty gritty details of the decisions that bring this to life https://huggingface.co./blog/from-chunks-to-blocks
Complete with an interactive visualization that shows the power of deduplication in action - taking a 191GB repo to ~97GB and shaving a few hours off upload speeds.
The darker each block in the heatmap, the more we dedupe, the less we have to transfer. Clicking on a file's blocks shows all other files that share blocks.
Doing a lot of benchmarking and visualization work, which means I'm always searching for interesting repos in terms of file types, size, branches, and overall structure.
To help, I built a Space jsulz/repo-info that lets you search for any repo and get back:
- Treemap of the repository, color coded by file/directory size - Repo branches and their size - Cumulative size of different file types (e.g., the total size of all the safetensors in the repo)
And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub - https://huggingface.co./blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes.
Something I love about working at Hugging Face is the opportunity to design and work in public. Right now, weβre redesigning the architecture that supports uploads and downloads on the Hub.
Datasets and models are growing fast, and so are the challenges of storing and transferring them efficiently. To keep up, we're introducing a new protocol for uploads and downloads, supported by a content-addressed store (CAS).
Hereβs whatβs coming:
π¦ Smarter uploads: Chunk-level management enables advanced deduplication, compression, and reduces redundant transfers, speeding up uploads. β‘ Efficient downloads: High throughput and low latency ensure fast access, even during high-demand model releases. π Enhanced security: Validate uploads before storage to block malicious or invalid data.
We analyzed 24 hours of global upload activity in October (88 countries, 130TB of data!) to design a system that scales with your needs.
The result? A proposed infrastructure with CAS nodes in us-east-1, eu-west-3, and ap-southeast-1.
When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. Thatβs where our chunk-based approach comes in.
Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:
In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isnβt just a performance boost. Itβs a rethinking of how we manage models and datasets on the Hub.
We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?
The Hugging Face Hub hosts over 1.5M Model, Dataset, and Space repositories. To scale to 10M+, the XetHub team (https://huggingface.co./xet-team) is replacing Git LFS with a new technology that improves storage and transfer capabilities with some future developer experience benefits to boot.
Thanks to @yuchenglow and @port8080 (for their analysis covering LFS usage from March 2022βSept 2024), we now have insights into what weβre storing. Check out the Gradio app to explore: - Storage growth over time - File types over all repositories - Some simple optimizations we're investigating
We shut down XetHub today after almost 2 years. What we learned from launching our Git-scaled product from scratch: - Don't make me change my workflow - Data inertia is real - ML best practices are still evolving
Closing the door on our public product lets us focus on our new goal of scaling HF Hub's storage backend to improve devX for a larger community. We'd love to hear your thoughts on what experiences we can improve!
We did a thing! Eight weeks into our Hugging Face tenure, we can demo a round-trip of Xet-backed files from our local machine to a prod Hugging Face S3 bucket and back. π
Itβs been exciting to dive into how the Hub is built and design our steel thread through the infrastructure. Now that the thread is up, we can kick off project Capacious Extremis πͺ to add all the other goodies: authentication, authorization, deduplication, privacy, and more.
What does this mean for you? Youβre one step closer to β‘ faster downloads, uploads, and iterative development on Hugging Face Hub!β¨This is our first step toward replacing Git LFS as the Hub's storage backend: https://huggingface.co./blog/xethub-joins-hf
In August, the XetHub team joined Hugging Face - https://huggingface.co./blog/xethub-joins-hf - and weβve been rolling up our sleeves to bring the best of both worlds together. We started with a deep dive into the current state of files stored with Git LFS on the Hub.
Getting this information was no small feat. We had to: * Analyze a complete database dump of all repositories and files stored in Git LFS across Hugging Face. * Parse through metadata on file sizes and types to accurately map the storage breakdown across Spaces, Models, and Datasets.