Doing a lot of benchmarking and visualization work, which means I'm always searching for interesting repos in terms of file types, size, branches, and overall structure.
To help, I built a Space jsulz/repo-info that lets you search for any repo and get back:
- Treemap of the repository, color coded by file/directory size - Repo branches and their size - Cumulative size of different file types (e.g., the total size of all the safetensors in the repo)
And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub - https://huggingface.co./blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes.
Something I love about working at Hugging Face is the opportunity to design and work in public. Right now, we’re redesigning the architecture that supports uploads and downloads on the Hub.
Datasets and models are growing fast, and so are the challenges of storing and transferring them efficiently. To keep up, we're introducing a new protocol for uploads and downloads, supported by a content-addressed store (CAS).
Here’s what’s coming:
📦 Smarter uploads: Chunk-level management enables advanced deduplication, compression, and reduces redundant transfers, speeding up uploads. ⚡ Efficient downloads: High throughput and low latency ensure fast access, even during high-demand model releases. 🔒 Enhanced security: Validate uploads before storage to block malicious or invalid data.
We analyzed 24 hours of global upload activity in October (88 countries, 130TB of data!) to design a system that scales with your needs.
The result? A proposed infrastructure with CAS nodes in us-east-1, eu-west-3, and ap-southeast-1.
When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.
Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:
⏩ Only upload the chunks that changed. 🚀 Download just the updates, not the whole file. 🧠 We store your file as deduplicated chunks
In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.
We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?
The Hugging Face Hub hosts over 1.5M Model, Dataset, and Space repositories. To scale to 10M+, the XetHub team (https://huggingface.co./xet-team) is replacing Git LFS with a new technology that improves storage and transfer capabilities with some future developer experience benefits to boot.
Thanks to @yuchenglow and @port8080 (for their analysis covering LFS usage from March 2022–Sept 2024), we now have insights into what we’re storing. Check out the Gradio app to explore: - Storage growth over time - File types over all repositories - Some simple optimizations we're investigating
We shut down XetHub today after almost 2 years. What we learned from launching our Git-scaled product from scratch: - Don't make me change my workflow - Data inertia is real - ML best practices are still evolving
Closing the door on our public product lets us focus on our new goal of scaling HF Hub's storage backend to improve devX for a larger community. We'd love to hear your thoughts on what experiences we can improve!
We did a thing! Eight weeks into our Hugging Face tenure, we can demo a round-trip of Xet-backed files from our local machine to a prod Hugging Face S3 bucket and back. 🚀
It’s been exciting to dive into how the Hub is built and design our steel thread through the infrastructure. Now that the thread is up, we can kick off project Capacious Extremis 🪄 to add all the other goodies: authentication, authorization, deduplication, privacy, and more.
What does this mean for you? You’re one step closer to ⚡ faster downloads, uploads, and iterative development on Hugging Face Hub! This is our first step toward replacing Git LFS as the Hub's storage backend: https://huggingface.co./blog/xethub-joins-hf
In August, the XetHub team joined Hugging Face - https://huggingface.co./blog/xethub-joins-hf - and we’ve been rolling up our sleeves to bring the best of both worlds together. We started with a deep dive into the current state of files stored with Git LFS on the Hub.
Getting this information was no small feat. We had to: * Analyze a complete database dump of all repositories and files stored in Git LFS across Hugging Face. * Parse through metadata on file sizes and types to accurately map the storage breakdown across Spaces, Models, and Datasets.