Xet Team

company
Activity Feed

AI & ML interests

None defined yet.

Articles

xet-team's activity

jsulz 
posted an update about 2 months ago
view post
Post
1354
Doing a lot of benchmarking and visualization work, which means I'm always searching for interesting repos in terms of file types, size, branches, and overall structure.

To help, I built a Space jsulz/repo-info that lets you search for any repo and get back:

- Treemap of the repository, color coded by file/directory size
- Repo branches and their size
- Cumulative size of different file types (e.g., the total size of all the safetensors in the repo)

And because I'm interested in how this will fit in our work to leverage content-defined chunking for versioning repos on the Hub
- https://huggingface.co./blog/from-files-to-chunks - everything has the number of chunks (1 chunk = 64KB) as well as the total size in bytes.

Some of the treemaps are pretty cool. Attached are black-forest-labs/FLUX.1-dev and for fun laion/laion-audio-preview (which has nearly 10k .tar files 🤯)

  • 2 replies
·
jsulz 
posted an update about 2 months ago
view post
Post
1556
Something I love about working at Hugging Face is the opportunity to design and work in public. Right now, we’re redesigning the architecture that supports uploads and downloads on the Hub.

Datasets and models are growing fast, and so are the challenges of storing and transferring them efficiently. To keep up, we're introducing a new protocol for uploads and downloads, supported by a content-addressed store (CAS).

Here’s what’s coming:

📦 Smarter uploads: Chunk-level management enables advanced deduplication, compression, and reduces redundant transfers, speeding up uploads.
⚡ Efficient downloads: High throughput and low latency ensure fast access, even during high-demand model releases.
🔒 Enhanced security: Validate uploads before storage to block malicious or invalid data.

We analyzed 24 hours of global upload activity in October (88 countries, 130TB of data!) to design a system that scales with your needs.

The result? A proposed infrastructure with CAS nodes in us-east-1, eu-west-3, and ap-southeast-1.

🔗 Read the blog post for the full details: https://huggingface.co./blog/rearchitecting-uploads-and-downloads

🌟 Check out our interactive demo to explore the data yourself!
xet-team/cas-analysis

We’d love to hear your feedback - let us know if you have questions or want to see more.
·
jsulz 
posted an update 2 months ago
view post
Post
2925
When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

⏩ Only upload the chunks that changed.
🚀 Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co./blog/from-files-to-chunks
erinys 
posted an update 3 months ago
jsulz 
posted an update 4 months ago
view post
Post
1657
The Hugging Face Hub hosts over 1.5M Model, Dataset, and Space repositories. To scale to 10M+, the XetHub team (https://huggingface.co./xet-team) is replacing Git LFS with a new technology that improves storage and transfer capabilities with some future developer experience benefits to boot.

Thanks to @yuchenglow and @port8080 (for their analysis covering LFS usage from March 2022–Sept 2024), we now have insights into what we’re storing. Check out the Gradio app to explore:
- Storage growth over time
- File types over all repositories
- Some simple optimizations we're investigating

xet-team/lfs-analysis
erinys 
posted an update 4 months ago
view post
Post
1971
We shut down XetHub today after almost 2 years. What we learned from launching our Git-scaled product from scratch:
- Don't make me change my workflow
- Data inertia is real
- ML best practices are still evolving

Closing the door on our public product lets us focus on our new goal of scaling HF Hub's storage backend to improve devX for a larger community. We'd love to hear your thoughts on what experiences we can improve!

Read the full post: https://xethub.com/blog/shutting-down-xethub-learnings-and-takeaways
·
erinys 
posted an update 4 months ago
view post
Post
1376
We did a thing! Eight weeks into our Hugging Face tenure, we can demo a round-trip of Xet-backed files from our local machine to a prod Hugging Face S3 bucket and back. 🚀

It’s been exciting to dive into how the Hub is built and design our steel thread through the infrastructure. Now that the thread is up, we can kick off project Capacious Extremis 🪄 to add all the other goodies: authentication, authorization, deduplication, privacy, and more.

What does this mean for you? You’re one step closer to ⚡ faster downloads, uploads, and iterative development on Hugging Face Hub!
This is our first step toward replacing Git LFS as the Hub's storage backend: https://huggingface.co./blog/xethub-joins-hf

Check out the demo on LinkedIn to see the transfer in action: https://www.linkedin.com/posts/annux_youve-heard-of-blue-steel-but-have-activity-7245062126535405568-3cvJ
jsulz 
posted an update 4 months ago
view post
Post
2077
In August, the XetHub team joined Hugging Face
- https://huggingface.co./blog/xethub-joins-hf - and we’ve been rolling up our sleeves to bring the best of both worlds together. We started with a deep dive into the current state of files stored with Git LFS on the Hub.

Getting this information was no small feat. We had to:
* Analyze a complete database dump of all repositories and files stored in Git LFS across Hugging Face.
* Parse through metadata on file sizes and types to accurately map the storage breakdown across Spaces, Models, and Datasets.

You can read more about the findings (with some jaw-dropping stats + charts) here https://www.linkedin.com/feed/update/urn:li:activity:7244486280351285248